Path: blob/master/section-2-data-science-and-ml-tools/scikit-learn-exercises-solutions.ipynb
874 views
Scikit-Learn Practice Solutions
This notebook offers a set of potential solutions to the Scikit-Learn excercise notebook.
Exercises are based off (and directly taken from) the quick introduction to Scikit-Learn notebook.
Different tasks will be detailed by comments or text.
For further reference and resources, it's advised to check out the Scikit-Learn documnetation.
And if you get stuck, try searching for a question in the following format: "how to do XYZ with Scikit-Learn", where XYZ is the function you want to leverage from Scikit-Learn.
Since we'll be working with data, we'll import Scikit-Learn's counterparts, Matplotlib, NumPy and pandas.
Let's get started.
End-to-end Scikit-Learn classification workflow
Let's start with an end to end Scikit-Learn workflow.
More specifically, we'll:
Get a dataset ready
Prepare a machine learning model to make predictions
Fit the model to the data and make a prediction
Evaluate the model's predictions
The data we'll be using is stored on GitHub. We'll start with heart-disease.csv
, a dataset which contains anonymous patient data and whether or not they have heart disease.
Note: When viewing a .csv
on GitHub, make sure it's in the raw format. For example, the URL should look like: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv
1. Getting a dataset ready
Our goal here is to build a machine learning model on all of the columns except target
to predict target
.
In essence, the target
column is our target variable (also called y
or labels
) and the rest of the other columns are our independent variables (also called data
or X
).
And since our target variable is one thing or another (heart disease or not), we know our problem is a classification problem (classifying whether something is one thing or another).
Knowing this, let's create X
and y
by splitting our dataframe up.
Now we've split our data into X
and y
, we'll use Scikit-Learn to split it into training and test sets.
What do you notice about the different shapes of the data?
Since our data is now in training and test sets, we'll build a machine learning model to fit patterns in the training data and then make predictions on the test data.
To figure out which machine learning model we should use, you can refer to Scikit-Learn's machine learning map.
After following the map, you decide to use the RandomForestClassifier
.
2. Preparing a machine learning model
Now you've got a RandomForestClassifier
instance, let's fit it to the training data.
Once it's fit, we'll make predictions on the test data.
3. Fitting a model and making predictions
4. Evaluating a model's predictions
Evaluating predictions is as important making them. Let's check how our model did by calling the score()
method on it and passing it the training (X_train, y_train
) and testing data.
How did you model go?
What metric does
score()
return for classifiers?Did your model do better on the training dataset or test dataset?
Experimenting with different classification models
Now we've quickly covered an end-to-end Scikit-Learn workflow and since experimenting is a large part of machine learning, we'll now try a series of different machine learning models and see which gets the best results on our dataset.
Going through the Scikit-Learn machine learning map, we see there are a number of different classification models we can try (different models are in the green boxes).
For this exercise, the models we're going to try and compare are:
KNeighborsClassifier (also known as K-Nearest Neighbors or KNN)
SVC (also known as support vector classifier, a form of support vector machine)
LogisticRegression (despite the name, this is actually a classifier)
RandomForestClassifier (an ensemble method and what we used above)
We'll follow the same workflow we used above (except this time for multiple models):
Import a machine learning model
Get it ready
Fit it to the data and make predictions
Evaluate the fitted model
Note: Since we've already got the data ready, we can reuse it in this section.
Thanks to the consistency of Scikit-Learn's API design, we can use virtually the same code to fit, score and make predictions with each of our models.
To see which model performs best, we'll do the following:
Instantiate each model in a dictionary
Create an empty results dictionary
Fit each model on the training data
Score each model on the test data
Check the results
If you're wondering what it means to instantiate each model in a dictionary, see the example below.
Since each model we're using has the same fit()
and score()
functions, we can loop through our models dictionary and, call fit()
on the training data and then call score()
with the test data.
Which model performed the best?
Do the results change each time you run the cell?
Why do you think this is?
Due to the randomness of how each model finds patterns in the data, you might notice different results each time.
Without manually setting the random state using the random_state
parameter of some models or using a NumPy random seed, every time you run the cell, you'll get slightly different results.
Let's see this in effect by running the same code as the cell above, except this time setting a NumPy random seed equal to 42.
Run the cell above a few times, what do you notice about the results?
Which model performs the best this time?
What happens if you add a NumPy random seed to the cell where you called
train_test_split()
(towards the top of the notebook) and then rerun the cell above?
Let's make our results a little more visual.
Using np.random.seed(42)
results in the LogisticRegression
model perfoming the best (at least on my computer).
Let's tune its hyperparameters and see if we can improve it.
Hyperparameter Tuning
Remember, if you're ever trying to tune a machine learning models hyperparameters and you're not sure where to start, you can always search something like "MODEL_NAME hyperparameter tuning".
In the case of LogisticRegression, you might come across articles, such as Hyperparameter Tuning Using Grid Search by Chris Albon.
The article uses GridSearchCV
but we're going to be using RandomizedSearchCV
.
The different hyperparameters to search over have been setup for you in log_reg_grid
but feel free to change them.
Since we've got a set of hyperparameters we can import RandomizedSearchCV
, pass it our dictionary of hyperparameters and let it search for the best combination.
Once RandomizedSearchCV
has finished, we can find the best hyperparmeters it found using the best_params_
attributes.
After hyperparameter tuning, did the models score improve? What else could you try to improve it? Are there any other methods of hyperparameter tuning you can find for LogisticRegression
?
Classifier Model Evaluation
We've tried to find the best hyperparameters on our model using RandomizedSearchCV
and so far we've only been evaluating our model using the score()
function which returns accuracy.
But when it comes to classification, you'll likely want to use a few more evaluation metrics, including:
Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line).
Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part and evaluates performance as an average.
Precision - Proportion of true positives over total number of samples. Higher precision leads to less false positives.
Recall - Proportion of true positives over total number of true positives and false positives. Higher recall leads to less false negatives.
F1 score - Combines precision and recall into one metric. 1 is best, 0 is worst.
Classification report - Sklearn has a built-in function called
classification_report()
which returns some of the main classification metrics such as precision, recall and f1-score.ROC Curve - Receiver Operating Characterisitc is a plot of true positive rate versus false positive rate.
Area Under Curve (AUC) - The area underneath the ROC curve. A perfect model achieves a score of 1.0.
Before we get to these, we'll instantiate a new instance of our model using the best hyerparameters found by RandomizedSearchCV
.
Now it's to import the relative Scikit-Learn methods for each of the classification evaluation metrics we're after.
Evaluation metrics are very often comparing a model's predictions to some ground truth labels.
Let's make some predictions on the test data using our latest model and save them to y_preds
.
Time to use the predictions our model has made to evaluate it beyond accuracy.
Challenge: The in-built confusion_matrix
function in Scikit-Learn produces something not too visual, how could you make your confusion matrix more visual?
You might want to search something like "how to plot a confusion matrix". Note: There may be more than one way to do this.
How about a classification report?
Challenge: Write down what each of the columns in this classification report are.
Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
Support - The number of samples each metric was calculated on.
Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).
The classification report gives us a range of values for precision, recall and F1 score, time to find these metrics using Scikit-Learn functions.
Confusion matrix: done. Classification report: done. ROC (receiver operator characteristic) curve & AUC (area under curve) score: not done.
Let's fix this.
If you're unfamiliar with what a ROC curve, that's your first challenge, to read up on what one is.
In a sentence, a ROC curve is a plot of the true positive rate versus the false positive rate.
And the AUC score is the area behind the ROC curve.
Scikit-Learn provides a handy function for creating both of these called plot_roc_curve()
.
Beautiful! We've gone far beyond accuracy with a plethora extra classification evaluation metrics.
If you're not sure about any of these, don't worry, they can take a while to understand. That could be an optional extension, reading up on a classification metric you're not sure of.
The thing to note here is all of these metrics have been calculated using a single training set and a single test set. Whilst this is okay, a more robust way is to calculate them using cross-validation.
We can calculate various evaluation metrics using cross-validation using Scikit-Learn's cross_val_score()
function along with the scoring
parameter.
In the examples, the cross-validated accuracy is found by taking the mean of the array returned by cross_val_score()
.
Now it's time to find the same for precision, recall and F1 score.
Exporting and importing a trained model
Once you've trained a model, you may want to export it and save it to file so you can share it or use it elsewhere.
One method of exporting and importing models is using the joblib library.
In Scikit-Learn, exporting and importing a trained model is known as model persistence.
What do you notice about the loaded trained model results versus the original (pre-exported) model results?
Scikit-Learn Regression Practice
For the next few exercises, we're going to be working on a regression problem, in other words, using some data to predict a number.
Our dataset is a table of car sales, containing different car characteristics as well as a sale price.
We'll use Scikit-Learn's built-in regression machine learning models to try and learn the patterns in the car characteristics and their prices on a certain group of the dataset before trying to predict the sale price of a group of cars the model has never seen before.
To begin, we'll import the data from GitHub into a pandas DataFrame, check out some details about it and try to build a model as soon as possible.
Looking at the output of info()
,
How many rows are there total?
What datatypes are in each column?
How many missing values are there in each column?
Knowing this information, what would happen if we tried to model our data as it is?
Let's see.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-44-476d8071e1b5> in <module>
2 from sklearn.ensemble import RandomForestRegressor
3 car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sales.Price
----> 4 rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
293 """
294 # Validate or convert input data
--> 295 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
296 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
297 if sample_weight is not None:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
513 array = array.astype(dtype, casting="unsafe", copy=False)
514 else:
--> 515 array = np.asarray(array, order=order, dtype=dtype)
516 except ComplexWarning:
517 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Honda'
As we see, the cell above breaks because our data contains non-numerical values as well as missing data.
To take care of some of the missing data, we'll remove the rows which have no labels (all the rows with missing values in the Price
column).
Building a pipeline
Since our car_sales
data has missing numerical values as well as the data isn't all numerical, we'll have to fix these things before we can fit a machine learning model on it.
There are ways we could do this with pandas but since we're practicing Scikit-Learn, we'll see how we might do it with the Pipeline
class.
Because we're modifying columns in our dataframe (filling missing values, converting non-numerical data to numbers) we'll need the ColumnTransformer
, SimpleImputer
and OneHotEncoder
classes as well.
Finally, because we'll need to split our data into training and test sets, we'll import train_test_split
as well.
Now we've got the necessary tools we need to create our preprocessing Pipeline
which fills missing values along with turning all non-numerical data into numbers.
Let's start with the categorical features.
It would be safe to treat Doors
as a categorical feature as well, however since we know the vast majority of cars have 4 doors, we'll impute the missing Doors
values as 4.
Now onto the numeric features. In this case, the only numeric feature is the Odometer (KM)
column. Let's fill its missing values with the median.
Time to put all of our individual transformer Pipeline
's into a single ColumnTransformer
instance.
Boom! Now our preprocessor
is ready, time to import some regression models to try out.
Comparing our data to the Scikit-Learn machine learning map, we can see there's a handful of different regression models we can try.
SVR(kernel="linear") - short for Support Vector Regressor, a form form of support vector machine.
SVR(kernel="rbf") - short for Support Vector Regressor, a form of support vector machine.
RandomForestRegressor - the regression version of RandomForestClassifier.
Again, thanks to the design of the Scikit-Learn library, we're able to use very similar code for each of these models.
To test them all, we'll create a dictionary of regression models and an empty dictionary for regression model results.
Our regression model dictionary is prepared as well as an empty dictionary to append results to, time to get the data split into X
(feature variables) and y
(target variable) as well as training and test sets.
In our car sales problem, we're trying to use the different characteristics of a car (X
) to predict its sale price (y
).
How many rows are in each set?
How many columns are in each set?
Alright, our data is split into training and test sets, time to build a small loop which is going to:
Go through our
regression_models
dictionaryCreate a
Pipeline
which contains ourpreprocessor
as well as one of the models in the dictionaryFits the
Pipeline
to the car sales training dataEvaluates the target model on the car sales test data and appends the results to our
regression_results
dictionary
Our regression models have been fit, let's see how they did!
Which model did the best?
How could you improve its results?
What metric does the
score()
method of a regression model return by default?
Since we've fitted some models but only compared them via the default metric contained in the score()
method (R^2 score or coefficient of determination), let's take the RidgeRegression
model and evaluate it with a few other regression metrics.
Specifically, let's find:
R^2 (pronounced r-squared) or coefficient of determination - Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).
Scikit-Learn has a few classes built-in which are going to help us with these, namely, mean_absolute_error
, mean_squared_error
and r2_score
.
All the evaluation metrics we're concerned with compare a model's predictions with the ground truth labels. Knowing this, we'll have to make some predictions.
Let's create a Pipeline
with the preprocessor
and a Ridge()
model, fit it on the car sales training data and then make predictions on the car sales test data.
Nice! Now we've got some predictions, time to evaluate them. We'll find the mean squared error (MSE), mean absolute error (MAE) and R^2 score (coefficient of determination) of our model.
Boom! Our model could potentially do with some hyperparameter tuning (this would be a great extension). And we could probably do with finding some more data on our problem, 1000 rows doesn't seem to be sufficient.
How would you export the trained regression model?
Extensions
You should be proud. Getting this far means you've worked through a classification problem and regression problem using pure (mostly) Scikit-Learn (no easy feat!).
For more exercises, check out the Scikit-Learn getting started documentation. A good practice would be to read through it and for the parts you find interesting, add them into the end of this notebook.
Finally, as always, remember, the best way to learn something new is to try it. And try it relentlessly. If you're unsure of how to do something, never be afraid to ask a question or search for something such as, "how to tune the hyperparmaters of a scikit-learn ridge regression model".