
A code-along of Random Forest, AdaBoosting and Gradient Boosting methods for Data Science beginners.
Who is this for:
As the subtitle suggests, this code-along post is for beginners interested in making their first, more advanced, supervised machine learning model. Perhaps you’re wanting to know how to improve your Titanic score on Kaggle — well this code along will show you a way that could boost your score significantly straight away.
This is for people who learn best by doing.
My aim is to demystify the application of machine learning. Yes, the theory behind machine learning can be quite complex and I strongly encourage you to dive deeper and expose yourself to the underlying ‘math’ of the things you do and use. However, we all need to start somewhere and sometimes getting a feel for how things work and seeing results can motivate your deeper learning.
I don’t like to repeat content that is already saturated on platforms like Medium etc., so for this reason, I am not deep-diving on how these algorithms work.
I will provide very brief “in a nutshell, this is what’s going on” explanations of each method and will point you in the direction of some relevant articles, blogs and papers (and I really do encourage you to dive deeper).
Prerequisites:
This code-along aims to help you jump right in and get your hands dirty with building a machine learning model using 3 different ensemble methods: random forest, AdaBoosting and gradient boosting.
I assume you have a general understanding of supervised vs. unsupervised learning and some knowledge of basic decision tree models.
I’ll be using:
- The Churn in Telecom dataset from Kaggle
- Pandas for data cleaning
- Scikit-Learn for modelling
What we’re going to do here:
- Quick EDA
- Create 4 models (FSM and 3 ensemble models)
- Compare accuracy and recall metrics of each model
Ensemble Models:
Don’t over think this one —
Ensemble models are an ensemble of models!
Mind blown?
The idea behind ensemble methods is the idea of “wisdom of the crowd”. If you ask me
“Does my bum look good in this skirt?”
I tell you
“Hell yeah girl!”
But perhaps you want to make sure I’m not just being polite, so you ask 5 other people and they give you the same response. After asking 20 more people, you’re feeling pretty confident about yourself (as you should!).
This is the idea behind ensemble methods.
Ensemble models give better predictions by combining the predictions of lots of single models. This might be done through aggregation of prediction results or by improving upon model predictions. For this reason, ensemble methods tend to win competitions. For more on this, please see this article and this blog for starters.
Random forests, AdaBoosting and Gradient boosting are just 3 ensemble methods that I’ve chosen to look at today, but there are many others!
Random Forest:
Random Forest is a supervised learning algorithm for both classification and regression problems.
In a nutshell, the Random Forest algorithm is an ensemble of Decision Tree models.
The decision tree algorithm chooses its splits based on maximising information gain at every stage, so creating multiple decision trees on the same dataset will result in the same tree. For our ensemble method to be effective, we need variability in our individual models. The random forest algorithm takes advantage of bagging and the subspace sampling method to create this variability.
Please see Leo Breiman’s paper and website for the nitty-gritty details of random forest. Or if you’re more of a blog/article person, see here.
AdaBoosting (Adaptive Boosting):
AdaBoosting a.k.a. Adaptive Boosting, was the first boosting algorithm to be invented and so I touch on it here out of nostalgia. There have since been many boosting algorithms that have improved upon AdaBoosting but it’s still a good place to start to learn about boosting algorithms.
In a nutshell, the AdaBoost model is trained on a subsample of a dataset and assigns weights to each point in the dataset and changes those weights upon each model iteration. If the learner (current model) correctly classifies a point, the point’s weight is reduced and if the learner incorrectly classifies a point, the point’s weight is increased.
See more on AdaBoosting here.
Gradient Boosting:
Gradient Boosting is a more advanced boosting algorithm and takes advantage of gradient decent, which you might remember from linear regression.
In a nutshell, Gradient Boosting improves upon each weak learner in a similar way as the AdaBoosting algorithm, except gradient boosting calculates the residuals at each point and combines it with a loss function. So the algorithm uses gradient descent to minimise the overall loss and uses the gradients and loss as predictors to train the next learner.
For more, have a read of this and this.
<gif>
OK OK! Let’s get to the fun part!
EDA:
Let’s do some super quick exploratory data analysis. I chose this dataset deliberately because it’s already pretty clean when you download it from Kaggle. But, as new data scientists, it’s important for us to continue to hone our EDA skills.
Data Question:
As you probably guessed from the title of the dataset, this model aims to predict churn — a very common problem businesses face.
Thinking about which metrics we want to use to evaluate our model, let’s think about what we want our model to predict and what is worse: a false negative prediction or a false positive prediction.
Our model should predict whether a customer in our dataset will stay with the company(False) or if they will leave (True).
In this scenario, we have:
False negative: the model predicts that a customer stays with the company (False), when in fact that customer churns (True).
False positive: the model predicts that a customer will churn (True), when in fact they will stay(False).
Given this, we would probably argue that false negatives are more costly to the company as it would be a missed opportunity to market towards keeping those customers. For this reason, we will use accuracy and recall scores to evaluate our model performance.*
Load and Preview Data:
First, download the data to your directory here.
Imports:
import pandas as pd
from .model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
Load and Preview:
df = pd.read_csv(‘data/raw/telecom_churn_data.csv’)
df.head()
From here, we can see that a row represents a Telecom customer.
We can identify pretty quickly what our target variable is going to be: churn.
I don’t like those spaces in the column headings so we’ll change that, check our data types and inspect any missing data (null values).
df.columns = df.columns.str.replace(‘ ‘, ‘_’)
df.info()
As I mentioned, I picked this dataset because it’s already pretty clean. Our datatypes make sense and you can see we have no null values. Of course, seeing 3333 non-null does not necessarily mean we don’t have null values — sometimes we have null values in disguise in our datasets. I have inspected the unique values for each column and can confirm that the data looks complete at this point (of course, please do let me know if you do find something suspicious that I’ve missed!).
Create Target and Features Variables:
# our target, ‘y’ variable:
y = df.churn
# df of our features data ‘X’ – drop target
X = df.drop(“churn”, axis=1)
Dealing with Categorical Features:
You may have noticed from our df.info() earlier that we have 4 columns of object type. 3 of these columns are useful categories: state, international_plan and voice_mail_plan.
The other object column is phone_number which, you guessed it, is a customer’s phone number. I argue that a person’s phone number shouldn’t have any great bearing on whether they decide to stick with a phone company, so for this reason, I choose to simply drop this column from our feature set. **
# drop phone_number column
X = X.drop(‘phone_number’, axis = 1)
So, let’s now dummy out the remaining 3 categorical columns. This will add a lot of columns to our feature set, which in turn will add some complexity to our model but for our example sake, we won’t worry too much about this right now.
# create dummy variables for categorical columns
X = pd.get_dummies(X, drop_first = True)
To learn more about dummy variables, one-hot-encoding methods and why I specified drop_first = True (the dummy trap), have a read of this article.
This is a good opportunity to do another X.head() to look at how your data-frame looks now.
Train-Test Split:
First, we split our X and y data into a training set used for training the model, and a testing set used for (you guessed it) testing the model. I’ve chosen to do a 0.25 split here.
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.25,
random_state=15)
Modelling
Throughout our modelling here, I’m mostly going to use the default parameters for the model objects upon instantiating them. I will talk about hyperparameter tuning at the end but in short, all the fun is in tuning the parameters so I want to leave that up to you to explore. I want to focus on showing you how the models change just by using their defaults.
For brevity sake, I am simply going to fit the models and spit out some metrics to compare each model.
As we mentioned above, we’ll use accuracy_score and recall_score as our metrics to compare.
FSM — First Shitty Model: Single Decision Tree
To see if our ensemble methods are any good, we first need to know how a single, base model performs on our data. Since this is a classification problem, we’re going to use a Decision Tree as our FSM. Since Decision Trees have a habit of overfitting, I’m going to set a max_depth of 5.
# instantiate decision tree object with default params
dtc = DecisionTreeClassifier(max_depth = 5)
# fit the model to our training data
dtc.fit(X_train, y_train)
Results:
Accuracy:
# calculate accuracy_score for training data:
print(accuracy_score(y_train, dtc.predict(X_train)))
0.9559823929571829
# calculate accuracy_score for test data:
print(accuracy_score(y_test, dtc.predict(X_test)))
0.934052757793765
This is suspiciously high and highlights some issues we have with accuracy score…
But, just going by accuracy here, this model isn’t doing too badly and there isn’t a huge discrepancy between training and test scores.
# calculate recall_score for train data:
print(recall_score(y_train, dtc.predict(X_train)))
0.7388888888888889
# calculate recall_score for test data:
print(recall_score(y_test, dtc.predict(X_test)))
0.6422764227642277
We can see here there is a slightly larger difference between the train and test recall scores. Still, these scores are quite good for a first model.
Remember, for this problem, we care more about recall score since we want to catch false negatives. Recall is what we want to try to maximise from this model.
Model 2: Random Forest
Next, we create a random forest model with max_depth of 5. Some versions of sklearn raise a warning if we leave n_estimators blank so I’ve set it to 100 here.
# instantiate random forest classifier object
rft = RandomForestClassifier(n_estimators=100, max_depth= 5)
# fit the model to the training data:
rft.fit(X_train, y_train)
Results:
Accuracy:
# calculate accuracy_score for training data:
print(accuracy_score(y_train, rft.predict(X_train)))
0.8855542216886755
# calculate accuracy_score for test data:
print(accuracy_score(y_test, rft.predict(X_test)))
0.86810551558753
Our accuracy score has actually gone down from our first model. Let’s check recall:
# calculate recall_score for train data:
print(recall_score(y_train, rft.predict(X_train)))
0.20555555555555555
# calculate recall_score for test data:
print(recall_score(y_test, rft.predict(X_test)))
0.1056910569105691
Wow, our recall score has gone way down for our random forrest model! This might seem strange at first, but there are many reasons why this might happen. One conceptual reason to keep in mind is that the single decision tree model is one model, where as, by nature, the random forest model is making its final predictions based on the votes from all the trees in the forest. For this reason, we get a more accurate predictions than the ‘opinion’ of one tree.
We also haven’t taken into consideration class imbalances or hyperparameters, so this example is a little contrived.
Let’s continue in the same way and see how to implement AdaBoosting and Gradient Boosting models and compare their performance:
Model 3: AdaBoosting
# instantiate adaboost classifier object
abc = AdaBoostClassifier(random_state = 15)
# fit the model to the training data:
abc.fit(X_train, y_train)
Results:
Accuracy:
# calculate accuracy_score for training data:
print(accuracy_score(y_train, abc.predict(X_train)))
0.8979591836734694
# calculate accuracy_score for test data:
print(accuracy_score(y_test, abc.predict(X_test)))
0.86810551558753
Surprisingly, our accuracy score for the test data stayed the same but we had some improvement in the accuracy score for the test data.
Recall:
# calculate recall_score for train data:
print(recall_score(y_train, abc.predict(X_train)))
0.4638888888888889
# calculate recall_score for test data:
print(recall_score(y_test, abc.predict(X_test)))
0.3333333333333333
Our recall score has improved somewhat significantly from the random forest model.
Let’s see how Gradient Boosting performs.
Model 4: Gradient Boosting
# instantiate gradient boost classifier object
gbc = GradientBoostingClassifier(random_state = 15)
# fit the model to the training data:
gbc.fit(X_train, y_train)
Results:
Accuracy:
# calculate accuracy_score for training data:
print(accuracy_score(y_train, gbc.predict(X_train)))
0.9731892757102841
# calculate accuracy_score for test data:
print(accuracy_score(y_test, gbc.predict(X_test)))
0.9508393285371702
Our highest accuracy scores so far. They’re not too far away from our first decision tree model. There’s also no significant evidence of overfitting on preliminary inspection.
Recall:
# calculate recall_score for train data:
print(recall_score(y_train, gbc.predict(X_train)))
0.8194444444444444
# calculate recall_score for test data:
print(recall_score(y_test, gbc.predict(X_test)))
0.7235772357723578
Once again, our highest recall score so far and this does outperform our first model significantly.
Given these 4 models, we would choose the gradient boosting model as our best.
Final Notes and Next Steps:
As you can see, it’s not difficult to employ these models — we’re simply creating model objects and comparing results — we’re not thinking too deeply about what’s happening under the hood. It’s easy to create and experiment with these different methods to see which ones out perform each other.
It should be noted, however, that this is a very contrived example meant only to show you how you can play around with these models and that ensemble methods generally do better than single models. The metrics presented in these models do not necessary have meaningful impact as we used all of the default parameters, so here are some next steps to start thinking about and acquainting yourself with…
Hyperparameter Tuning:
Hyperparameters are parameter values that are set before the learning process. This is different from model parameters that we ‘discover’ after we have trained our model.
If you are familiar with linear regression, the slope and intercept parameters are the model parameters that we are trying to optimise by training our model.
An example of a hypterparameter that we tuned in this example was when we set max_depth = 5, we set this before we fitted the model.
I understand you may not have come across a lot of these concepts and so this will become more clear as you do. But I encourage you to experiment with different hyperparameters to see how the models change with different parameter tunings.
Hint: In Jupyter and other IDEs, shift + tab inside the parentheses of any method or class allows you to quickly inspect the parameters the object takes. For the model classes, this will show most of the hyperparameters you can play with.
Don’t underestimate the value of “tinkering”
Here’s an article about hyperparameter tuning and feature engineering.
Class imbalances:
There’s one line of code that I can use to explain what class imbalance is:
y.value_counts()
False 2850
True 483
Name: churn, dtype: int64
As you can see here, nearly 86% of our data is labeled False. What this means is that our models can become unfairly biased towards False predictions simply because of the ratio of that label in our data. This is what we call a ‘class imbalance’ problem.
There are lots of ways to deal with class imbalances, including class weights and SMOTE. Taking this into consideration with our dataset here will improve our models significantly.
Wrapping up
I hope you’ve seen the value in how these ensemble methods can improve upon your models and I encourage you to try them out on different datasets to get a feel for them. As you learn more about the theory behind these algorithms, their hyperparameters, applying regularisation and class balancing methods, you’ll have a good head start on how these features play out and what effect they have on your models.
Happy learning!
Footnotes
- * Some might argue that false positives are also costly to a company since you are spending money on retention when the customer would have stayed anyway. I leave it up to you to think about this question and experiment with different scoring metrics. Here’s an article on some of the different metrics you can investigate.
- ** I leave it up to you to explore how you might want to handle the phone_number feature.
- *** Notice I have done no feature engineering in this example. Tuning regularisation hyperparameters can help with this also.
Citations
Thank you to the authors of the blogs and papers I have pointed readers to, to supplement this blog.