Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
mrdbourke
GitHub Repository: mrdbourke/zero-to-mastery-ml
Path: blob/master/section-2-data-science-and-ml-tools/scikit-learn-exercises.ipynb
874 views
Kernel: Python 3

Scikit-Learn Practice Exercises

This notebook offers a set of excercises for different tasks with Scikit-Learn.

Notes:

  • There may be more than one different way to answer a question or complete an exercise.

  • Some skeleton code has been implemented for you.

  • Exercises are based off (and directly taken from) the quick introduction to Scikit-Learn notebook.

  • Different tasks will be detailed by comments or text. Places to put your own code are defined by ### (don't remove anything other than ###).

For further reference and resources, it's advised to check out the Scikit-Learn documnetation.

And if you get stuck, try searching for a question in the following format: "how to do XYZ with Scikit-Learn", where XYZ is the function you want to leverage from Scikit-Learn.

Since we'll be working with data, we'll import Scikit-Learn's counterparts, Matplotlib, NumPy and pandas.

Let's get started.

# Setup matplotlib to plot inline (within the notebook) ### # Import the pyplot module of Matplotlib as plt ### # Import pandas under the abbreviation 'pd' ### # Import NumPy under the abbreviation 'np' ###

End-to-end Scikit-Learn classification workflow

Let's start with an end to end Scikit-Learn workflow.

More specifically, we'll:

  1. Get a dataset ready

  2. Prepare a machine learning model to make predictions

  3. Fit the model to the data and make a prediction

  4. Evaluate the model's predictions

The data we'll be using is stored on GitHub. We'll start with heart-disease.csv, a dataset which contains anonymous patient data and whether or not they have heart disease.

Note: When viewing a .csv on GitHub, make sure it's in the raw format. For example, the URL should look like: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv

1. Getting a dataset ready

# Import the heart disease dataset and save it to a variable # using pandas and read_csv() # Hint: You can directly pass the URL of a csv to read_csv() heart_disease = ### # Check the first 5 rows of the data ###

Our goal here is to build a machine learning model on all of the columns except target to predict target.

In essence, the target column is our target variable (also called y or labels) and the rest of the other columns are our independent variables (also called data or X).

And since our target variable is one thing or another (heart disease or not), we know our problem is a classification problem (classifying whether something is one thing or another).

Knowing this, let's create X and y by splitting our dataframe up.

# Create X (all columns except target) X = ### # Create y (only the target column) y = ###

Now we've split our data into X and y, we'll use Scikit-Learn to split it into training and test sets.

# Import train_test_split from sklearn's model_selection module ### # Use train_test_split to split X & y into training and test sets X_train, X_test, y_train, y_test = ###
# View the different shapes of the training and test datasets ###

What do you notice about the different shapes of the data?

Since our data is now in training and test sets, we'll build a machine learning model to fit patterns in the training data and then make predictions on the test data.

To figure out which machine learning model we should use, you can refer to Scikit-Learn's machine learning map.

After following the map, you decide to use the RandomForestClassifier.

2. Preparing a machine learning model

# Import the RandomForestClassifier from sklearn's ensemble module ### # Instantiate an instance of RandomForestClassifier as clf clf =

Now you've got a RandomForestClassifier instance, let's fit it to the training data.

Once it's fit, we'll make predictions on the test data.

3. Fitting a model and making predictions

# Fit the RandomForestClassifier to the training data clf.fit(###, ###)
# Use the fitted model to make predictions on the test data and # save the predictions to a variable called y_preds y_preds = clf.predict(###)

4. Evaluating a model's predictions

Evaluating predictions is as important making them. Let's check how our model did by calling the score() method on it and passing it the training (X_train, y_train) and testing data (X_test, y_test).

# Evaluate the fitted model on the training set using the score() function ###
# Evaluate the fitted model on the test set using the score() function ###
  • How did you model go?

  • What metric does score() return for classifiers?

  • Did your model do better on the training dataset or test dataset?

Experimenting with different classification models

Now we've quickly covered an end-to-end Scikit-Learn workflow and since experimenting is a large part of machine learning, we'll now try a series of different machine learning models and see which gets the best results on our dataset.

Going through the Scikit-Learn machine learning map, we see there are a number of different classification models we can try (different models are in the green boxes).

For this exercise, the models we're going to try and compare are:

We'll follow the same workflow we used above (except this time for multiple models):

  1. Import a machine learning model

  2. Get it ready

  3. Fit it to the data and make predictions

  4. Evaluate the fitted model

Note: Since we've already got the data ready, we can reuse it in this section.

# Import LinearSVC from sklearn's svm module ### # Import KNeighborsClassifier from sklearn's neighbors module ### # Import SVC from sklearn's svm module ### # Import LogisticRegression from sklearn's linear_model module ### # Note: we don't have to import RandomForestClassifier, since we already have

Thanks to the consistency of Scikit-Learn's API design, we can use virtually the same code to fit, score and make predictions with each of our models.

To see which model performs best, we'll do the following:

  1. Instantiate each model in a dictionary

  2. Create an empty results dictionary

  3. Fit each model on the training data

  4. Score each model on the test data

  5. Check the results

If you're wondering what it means to instantiate each model in a dictionary, see the example below.

# EXAMPLE: Instantiating a RandomForestClassifier() in a dictionary example_dict = {"RandomForestClassifier": RandomForestClassifier()} # Create a dictionary called models which contains all of the classification models we've imported # Make sure the dictionary is in the same format as example_dict # The models dictionary should contain 5 models models = {"LinearSVC": ###, "KNN": ###, "SVC": ###, "LogisticRegression": ###, "RandomForestClassifier": ###} # Create an empty dictionary called results results = ###

Since each model we're using has the same fit() and score() functions, we can loop through our models dictionary and, call fit() on the training data and then call score() with the test data.

# EXAMPLE: Looping through example_dict fitting and scoring the model example_results = {} for model_name, model in example_dict.items(): model.fit(X_train, y_train) example_results[model_name] = model.score(X_test, y_test) # EXAMPLE: View the results example_results
# Loop through the models dictionary items, fitting the model on the training data # and appending the model name and model score on the test data to the results dictionary for model_name, model in ###: model.fit(###) results[model_name] = model.score(###) # View the results results
  • Which model performed the best?

  • Do the results change each time you run the cell?

  • Why do you think this is?

Due to the randomness of how each model finds patterns in the data, you might notice different results each time.

Without manually setting the random state using the random_state parameter of some models or using a NumPy random seed, every time you run the cell, you'll get slightly different results.

Let's see this in effect by running the same code as the cell above, except this time setting a NumPy random seed equal to 42.

# Run the same code as the cell above, except this time set a NumPy random seed # equal to 42 np.random.seed(###) for model_name, model in models.items(): model.fit(X_train, y_train) results[model_name] = model.score(X_test, y_test) results
  • Run the cell above a few times, what do you notice about the results?

  • Which model performs the best this time?

  • What happens if you add a NumPy random seed to the cell where you called train_test_split() (towards the top of the notebook) and then rerun the cell above?

Let's make our results a little more visual.

# Create a pandas dataframe with the data as the values of the results dictionary, # the index as the keys of the results dictionary and a single column called accuracy. # Be sure to save the dataframe to a variable. results_df = pd.DataFrame(results.###(), results.###(), columns=[####]) # Create a bar plot of the results dataframe using plot.bar() ###

Using np.random.seed(42) results in the LogisticRegression model perfoming the best (at least on my computer).

Let's tune its hyperparameters and see if we can improve it.

Hyperparameter Tuning

Remember, if you're ever trying to tune a machine learning models hyperparameters and you're not sure where to start, you can always search something like "MODEL_NAME hyperparameter tuning".

In the case of LogisticRegression, you might come across articles, such as Hyperparameter Tuning Using Grid Search by Chris Albon.

The article uses GridSearchCV but we're going to be using RandomizedSearchCV.

The different hyperparameters to search over have been setup for you in log_reg_grid but feel free to change them.

# Different LogisticRegression hyperparameters log_reg_grid = {"C": np.logspace(-4, 4, 20), "solver": ["liblinear"]}

Since we've got a set of hyperparameters we can import RandomizedSearchCV, pass it our dictionary of hyperparameters and let it search for the best combination.

# Setup np random seed of 42 np.random.seed(###) # Import RandomizedSearchCV from sklearn's model_selection module # Setup an instance of RandomizedSearchCV with a LogisticRegression() estimator, # our log_reg_grid as the param_distributions, a cv of 5 and n_iter of 5. rs_log_reg = RandomizedSearchCV(estimator=###, param_distributions=###, cv=###, n_iter=###, verbose=###) # Fit the instance of RandomizedSearchCV ###

Once RandomizedSearchCV has finished, we can find the best hyperparmeters it found using the best_params_ attributes.

# Find the best parameters of the RandomizedSearchCV instance using the best_params_ attribute ###
# Score the instance of RandomizedSearchCV using the test data ###

After hyperparameter tuning, did the models score improve? What else could you try to improve it? Are there any other methods of hyperparameter tuning you can find for LogisticRegression?

Classifier Model Evaluation

We've tried to find the best hyperparameters on our model using RandomizedSearchCV and so far we've only been evaluating our model using the score() function which returns accuracy.

But when it comes to classification, you'll likely want to use a few more evaluation metrics, including:

  • Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line).

  • Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part and evaluates performance as an average.

  • Precision - Proportion of true positives over total number of samples. Higher precision leads to less false positives.

  • Recall - Proportion of true positives over total number of true positives and false positives. Higher recall leads to less false negatives.

  • F1 score - Combines precision and recall into one metric. 1 is best, 0 is worst.

  • Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.

  • ROC Curve - Receiver Operating Characterisitc is a plot of true positive rate versus false positive rate.

  • Area Under Curve (AUC) - The area underneath the ROC curve. A perfect model achieves a score of 1.0.

Before we get to these, we'll instantiate a new instance of our model using the best hyerparameters found by RandomizedSearchCV.

# Instantiate a LogisticRegression classifier using the best hyperparameters from RandomizedSearchCV clf = LogisticRegression(###) # Fit the new instance of LogisticRegression with the best hyperparameters on the training data ###

Now it's to import the relative Scikit-Learn methods for each of the classification evaluation metrics we're after.

# Import confusion_matrix and classification_report from sklearn's metrics module ### # Import precision_score, recall_score and f1_score from sklearn's metrics module ### # Import plot_roc_curve from sklearn's metrics module ###

Evaluation metrics are very often comparing a model's predictions to some ground truth labels.

Let's make some predictions on the test data using our latest model and save them to y_preds.

# Make predictions on test data and save them ###

Time to use the predictions our model has made to evaluate it beyond accuracy.

# Create a confusion matrix using the confusion_matrix function ###

Challenge: The in-built confusion_matrix function in Scikit-Learn produces something not too visual, how could you make your confusion matrix more visual?

You might want to search something like "how to plot a confusion matrix". Note: There may be more than one way to do this.

# Create a more visual confusion matrix ###

How about a classification report?

# Create a classification report using the classification_report function ###

Challenge: Write down what each of the columns in this classification report are.

  • Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.

  • Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.

  • F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.

  • Support - The number of samples each metric was calculated on.

  • Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.

  • Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.

  • Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).

The classification report gives us a range of values for precision, recall and F1 score, time to find these metrics using Scikit-Learn functions.

# Find the precision score of the model using precision_score() ###
# Find the recall score ###
# Find the F1 score ###

Confusion matrix: done. Classification report: done. ROC (receiver operator characteristic) curve & AUC (area under curve) score: not done.

Let's fix this.

If you're unfamiliar with what a ROC curve, that's your first challenge, to read up on what one is.

In a sentence, a ROC curve is a plot of the true positive rate versus the false positive rate.

And the AUC score is the area behind the ROC curve.

Scikit-Learn provides a handy function for creating both of these called plot_roc_curve().

# Plot a ROC curve using our current machine learning model using plot_roc_curve ###

Beautiful! We've gone far beyond accuracy with a plethora extra classification evaluation metrics.

If you're not sure about any of these, don't worry, they can take a while to understand. That could be an optional extension, reading up on a classification metric you're not sure of.

The thing to note here is all of these metrics have been calculated using a single training set and a single test set. Whilst this is okay, a more robust way is to calculate them using cross-validation.

We can calculate various evaluation metrics using cross-validation using Scikit-Learn's cross_val_score() function along with the scoring parameter.

# Import cross_val_score from sklearn's model_selection module ###
# EXAMPLE: By default cross_val_score returns 5 values (cv=5). cross_val_score(clf, X, y, scoring="accuracy", cv=5)
# EXAMPLE: Taking the mean of the returned values from cross_val_score # gives a cross-validated version of the scoring metric. cross_val_acc = np.mean(cross_val_score(clf, X, y, scoring="accuracy", cv=5)) cross_val_acc

In the examples, the cross-validated accuracy is found by taking the mean of the array returned by cross_val_score().

Now it's time to find the same for precision, recall and F1 score.

# Find the cross-validated precision ###
# Find the cross-validated recall ###
# Find the cross-validated F1 score ###

Exporting and importing a trained model

Once you've trained a model, you may want to export it and save it to file so you can share it or use it elsewhere.

One method of exporting and importing models is using the joblib library.

In Scikit-Learn, exporting and importing a trained model is known as model persistence.

# Import the dump and load functions from the joblib library ###
# Use the dump function to export the trained model to file ###
# Use the load function to import the trained model you just exported # Save it to a different variable name to the origial trained model ### # Evaluate the loaded trained model on the test data ###

What do you notice about the loaded trained model results versus the original (pre-exported) model results?

Scikit-Learn Regression Practice

For the next few exercises, we're going to be working on a regression problem, in other words, using some data to predict a number.

Our dataset is a table of car sales, containing different car characteristics as well as a sale price.

We'll use Scikit-Learn's built-in regression machine learning models to try and learn the patterns in the car characteristics and their prices on a certain group of the dataset before trying to predict the sale price of a group of cars the model has never seen before.

To begin, we'll import the data from GitHub into a pandas DataFrame, check out some details about it and try to build a model as soon as possible.

# Read in the car sales data car_sales = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv") # View the first 5 rows of the car sales data ###
# Get information about the car sales DataFrame ###

Looking at the output of info(),

  • How many rows are there total?

  • What datatypes are in each column?

  • How many missing values are there in each column?

# Find number of missing values in each column ###
# Find the datatypes of each column of car_sales ###

Knowing this information, what would happen if we tried to model our data as it is?

Let's see.

# EXAMPLE: This doesn't work because our car_sales data isn't all numerical from sklearn.ensemble import RandomForestRegressor car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sales.Price rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)

As we see, the cell above breaks because our data contains non-numerical values as well as missing data.

To take care of some of the missing data, we'll remove the rows which have no labels (all the rows with missing values in the Price column).

# Remove rows with no labels (NaN's in the Price column) ###

Building a pipeline

Since our car_sales data has missing numerical values as well as the data isn't all numerical, we'll have to fix these things before we can fit a machine learning model on it.

There are ways we could do this with pandas but since we're practicing Scikit-Learn, we'll see how we might do it with the Pipeline class.

Because we're modifying columns in our dataframe (filling missing values, converting non-numerical data to numbers) we'll need the ColumnTransformer, SimpleImputer and OneHotEncoder classes as well.

Finally, because we'll need to split our data into training and test sets, we'll import train_test_split as well.

# Import Pipeline from sklearn's pipeline module ### # Import ColumnTransformer from sklearn's compose module ### # Import SimpleImputer from sklearn's impute module ### # Import OneHotEncoder from sklearn's preprocessing module ### # Import train_test_split from sklearn's model_selection module ###

Now we've got the necessary tools we need to create our preprocessing Pipeline which fills missing values along with turning all non-numerical data into numbers.

Let's start with the categorical features.

# Define different categorical features categorical_features = ["Make", "Colour"] # Create categorical transformer Pipeline categorical_transformer = Pipeline(steps=[ # Set SimpleImputer strategy to "constant" and fill value to "missing" ("imputer", SimpleImputer(strategy=###, fill_value=###)), # Set OneHotEncoder to ignore the unknowns ("onehot", OneHotEncoder(handle_unknown=###))])

It would be safe to treat Doors as a categorical feature as well, however since we know the vast majority of cars have 4 doors, we'll impute the missing Doors values as 4.

# Define Doors features door_feature = ["Doors"] # Create Doors transformer Pipeline door_transformer = Pipeline(steps=[ # Set SimpleImputer strategy to "constant" and fill value to 4 ("imputer", SimpleImputer(strategy=###, fill_value=###))])

Now onto the numeric features. In this case, the only numeric feature is the Odometer (KM) column. Let's fill its missing values with the median.

# Define numeric features (only the Odometer (KM) column) numeric_features = ["Odometer (KM)"] # Crearte numeric transformer Pipeline numeric_transformer = ###(steps=[ # Set SimpleImputer strategy to fill missing values with the "Median" ("imputer", ###(strategy=###))])

Time to put all of our individual transformer Pipeline's into a single ColumnTransformer instance.

# Setup preprocessing steps (fill missing values, then convert to numbers) preprocessor = ColumnTransformer( transformers=[ # Use the categorical_transformer to transform the categorical_features ("cat", categorical_transformer, ###), # Use the door_transformer to transform the door_feature ("door", ###, door_feature), # Use the numeric_transformer to transform the numeric_features ("num", ###, ###)])

Boom! Now our preprocessor is ready, time to import some regression models to try out.

Comparing our data to the Scikit-Learn machine learning map, we can see there's a handful of different regression models we can try.

# Import Ridge from sklearn's linear_model module # Import SVR from sklearn's svm module # Import RandomForestRegressor from sklearn's ensemble module

Again, thanks to the design of the Scikit-Learn library, we're able to use very similar code for each of these models.

To test them all, we'll create a dictionary of regression models and an empty dictionary for regression model results.

# Create dictionary of model instances, there should be 4 total key, value pairs # in the form {"model_name": model_instance}. # Don't forget there's two versions of SVR, one with a "linear" kernel and the # other with kernel set to "rbf". regression_models = {"Ridge": ###, "SVR_linear": ###, "SVR_rbf": ###, "RandomForestRegressor": ###} # Create an empty dictionary for the regression results regression_results = ###

Our regression model dictionary is prepared as well as an empty dictionary to append results to, time to get the data split into X (feature variables) and y (target variable) as well as training and test sets.

In our car sales problem, we're trying to use the different characteristics of a car (X) to predict its sale price (y).

# Create car sales X data (every column of car_sales except Price) car_sales_X = ### # Create car sales y data (the Price column of car_sales) car_sales_y = ###
# Use train_test_split to split the car_sales_X and car_sales_y data into # training and test sets. # Give the test set 20% of the data using the test_size parameter. # For reproducibility set the random_state parameter to 42. car_X_train, car_X_test, car_y_train, car_y_test = train_test_split(###, ###, test_size=###, random_state=###) # Check the shapes of the training and test datasets ###
  • How many rows are in each set?

  • How many columns are in each set?

Alright, our data is split into training and test sets, time to build a small loop which is going to:

  1. Go through our regression_models dictionary

  2. Create a Pipeline which contains our preprocessor as well as one of the models in the dictionary

  3. Fits the Pipeline to the car sales training data

  4. Evaluates the target model on the car sales test data and appends the results to our regression_results dictionary

# Loop through the items in the regression_models dictionary for model_name, model in regression_models.items(): # Create a model Pipeline with a preprocessor step and model step model_pipeline = Pipeline(steps=[("preprocessor", ###), ("model", ###)]) # Fit the model Pipeline to the car sales training data print(f"Fitting {model_name}...") model_pipeline.###(###, ###) # Score the model Pipeline on the test data appending the model_name to the # results dictionary print(f"Scoring {model_name}...") regression_results[model_name] = model_pipeline.score(###, ###)

Our regression models have been fit, let's see how they did!

# Check the results of each regression model by printing the regression_results # dictionary ###
  • Which model did the best?

  • How could you improve its results?

  • What metric does the score() method of a regression model return by default?

Since we've fitted some models but only compared them via the default metric contained in the score() method (R^2 score or coefficient of determination), let's take the RidgeRegression model and evaluate it with a few other regression metrics.

Specifically, let's find:

  1. R^2 (pronounced r-squared) or coefficient of determination - Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

  2. Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.

  3. Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

Scikit-Learn has a few classes built-in which are going to help us with these, namely, mean_absolute_error, mean_squared_error and r2_score.

# Import mean_absolute_error from sklearn's metrics module ### # Import mean_squared_error from sklearn's metrics module ### # Import r2_score from sklearn's metrics module ###

All the evaluation metrics we're concerned with compare a model's predictions with the ground truth labels. Knowing this, we'll have to make some predictions.

Let's create a Pipeline with the preprocessor and a Ridge() model, fit it on the car sales training data and then make predictions on the car sales test data.

# Create RidgeRegression Pipeline with preprocessor as the "preprocessor" and # Ridge() as the "model". ridge_pipeline = ###(steps=[("preprocessor", ###), ("model", Ridge())]) # Fit the RidgeRegression Pipeline to the car sales training data ridge_pipeline.fit(###, ###) # Make predictions on the car sales test data using the RidgeRegression Pipeline car_y_preds = ridge_pipeline.###(###) # View the first 50 predictions ###

Nice! Now we've got some predictions, time to evaluate them. We'll find the mean squared error (MSE), mean absolute error (MAE) and R^2 score (coefficient of determination) of our model.

# EXAMPLE: Find the MSE by comparing the car sales test labels to the car sales predictions mse = mean_squared_error(car_y_test, car_y_preds) # Return the MSE mse
# Find the MAE by comparing the car sales test labels to the car sales predictions ### # Return the MAE ###
# Find the R^2 score by comparing the car sales test labels to the car sales predictions ### # Return the R^2 score ###

Boom! Our model could potentially do with some hyperparameter tuning (this would be a great extension). And we could probably do with finding some more data on our problem, 1000 rows doesn't seem to be sufficient.

  • How would you export the trained regression model?

Extensions

You should be proud. Getting this far means you've worked through a classification problem and regression problem using pure (mostly) Scikit-Learn (no easy feat!).

For more exercises, check out the Scikit-Learn getting started documentation. A good practice would be to read through it and for the parts you find interesting, add them into the end of this notebook.

Finally, as always, remember, the best way to learn something new is to try it. And try it relentlessly. If you're unsure of how to do something, never be afraid to ask a question or search for something such as, "how to tune the hyperparmaters of a scikit-learn ridge regression model".