GitHub Repository: mrdbourke/zero-to-mastery-ml
Path: blob/master/section-2-data-science-and-ml-tools/scikit-learn-exercises-solutions.ipynb
⁸⁷⁴ views

Kernel: Python 3

Scikit-Learn Practice Solutions

This notebook offers a set of potential solutions to the Scikit-Learn excercise notebook.

Exercises are based off (and directly taken from) the quick introduction to Scikit-Learn notebook.

Different tasks will be detailed by comments or text.

For further reference and resources, it's advised to check out the Scikit-Learn documnetation.

And if you get stuck, try searching for a question in the following format: "how to do XYZ with Scikit-Learn", where XYZ is the function you want to leverage from Scikit-Learn.

Since we'll be working with data, we'll import Scikit-Learn's counterparts, Matplotlib, NumPy and pandas.

Let's get started.

In [1]:

# Setup matplotlib to plot inline (within the notebook)
%matplotlib inline

# Import the pyplot module of Matplotlib as plt
import matplotlib.pyplot as plt

# Import pandas under the abbreviation 'pd'
import pandas as pd

# Import NumPy under the abbreviation 'np'
import numpy as np

End-to-end Scikit-Learn classification workflow

Let's start with an end to end Scikit-Learn workflow.

More specifically, we'll:

Get a dataset ready
Prepare a machine learning model to make predictions
Fit the model to the data and make a prediction
Evaluate the model's predictions

The data we'll be using is stored on GitHub. We'll start with heart-disease.csv, a dataset which contains anonymous patient data and whether or not they have heart disease.

Note: When viewing a .csv on GitHub, make sure it's in the raw format. For example, the URL should look like: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv

1. Getting a dataset ready

In [2]:

# Import the heart disease dataset and save it to a variable
# using pandas and read_csv()
# Hint: You can directly pass the URL of a csv to read_csv()
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")

# Check the first 5 rows of the data
heart_disease.head()

Out[2]:

Our goal here is to build a machine learning model on all of the columns except target to predict target.

In essence, the target column is our target variable (also called y or labels) and the rest of the other columns are our independent variables (also called data or X).

And since our target variable is one thing or another (heart disease or not), we know our problem is a classification problem (classifying whether something is one thing or another).

Knowing this, let's create X and y by splitting our dataframe up.

In [3]:

# Create X (all columns except target)
X = heart_disease.drop("target", axis=1)

# Create y (only the target column)
y = heart_disease["target"]

Now we've split our data into X and y, we'll use Scikit-Learn to split it into training and test sets.

In [4]:

# Import train_test_split from sklearn's model_selection module
from sklearn.model_selection import train_test_split

# Use train_test_split to split X & y into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [5]:

# View the different shapes of the training and test datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Out[5]:

((227, 13), (76, 13), (227,), (76,))

What do you notice about the different shapes of the data?

Since our data is now in training and test sets, we'll build a machine learning model to fit patterns in the training data and then make predictions on the test data.

To figure out which machine learning model we should use, you can refer to Scikit-Learn's machine learning map.

After following the map, you decide to use the RandomForestClassifier.

2. Preparing a machine learning model

In [6]:

# Import the RandomForestClassifier from sklearn's ensemble module
from sklearn.ensemble import RandomForestClassifier

# Instantiate an instance of RandomForestClassifier as clf
clf = RandomForestClassifier()

Now you've got a RandomForestClassifier instance, let's fit it to the training data.

Once it's fit, we'll make predictions on the test data.

3. Fitting a model and making predictions

In [7]:

# Fit the RandomForestClassifier to the training data
clf.fit(X_train, y_train)

Out[7]:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [8]:

# Use the fitted model to make predictions on the test data and
# save the predictions to a variable called y_preds
y_preds = clf.predict(X_test)

4. Evaluating a model's predictions

Evaluating predictions is as important making them. Let's check how our model did by calling the score() method on it and passing it the training (X_train, y_train) and testing data.

In [9]:

# Evaluate the fitted model on the training set using the score() function
clf.score(X_train, y_train)

Out[9]:

1.0

In [10]:

# Evaluate the fitted model on the test set using the score() function
clf.score(X_test, y_test)

Out[10]:

0.8026315789473685

How did you model go?
What metric does score() return for classifiers?
Did your model do better on the training dataset or test dataset?

Experimenting with different classification models

Now we've quickly covered an end-to-end Scikit-Learn workflow and since experimenting is a large part of machine learning, we'll now try a series of different machine learning models and see which gets the best results on our dataset.

Going through the Scikit-Learn machine learning map, we see there are a number of different classification models we can try (different models are in the green boxes).

For this exercise, the models we're going to try and compare are:

LinearSVC
KNeighborsClassifier (also known as K-Nearest Neighbors or KNN)
SVC (also known as support vector classifier, a form of support vector machine)
LogisticRegression (despite the name, this is actually a classifier)
RandomForestClassifier (an ensemble method and what we used above)

We'll follow the same workflow we used above (except this time for multiple models):

Import a machine learning model
Get it ready
Fit it to the data and make predictions
Evaluate the fitted model

Note: Since we've already got the data ready, we can reuse it in this section.

In [11]:

# Import LinearSVC from sklearn's svm module
from sklearn.svm import LinearSVC

# Import KNeighborsClassifier from sklearn's neighbors module
from sklearn.neighbors import KNeighborsClassifier

# Import SVC from sklearn's svm module
from sklearn.svm import SVC

# Import LogisticRegression from sklearn's linear_model module
from sklearn.linear_model import LogisticRegression

# Note: we don't have to import RandomForestClassifier, since we already have

Thanks to the consistency of Scikit-Learn's API design, we can use virtually the same code to fit, score and make predictions with each of our models.

To see which model performs best, we'll do the following:

Instantiate each model in a dictionary
Create an empty results dictionary
Fit each model on the training data
Score each model on the test data
Check the results

If you're wondering what it means to instantiate each model in a dictionary, see the example below.

In [12]:

# EXAMPLE: Instantiating a RandomForestClassifier() in a dictionary
example_dict = {"RandomForestClassifier": RandomForestClassifier()}

# Create a dictionary called models which contains all of the classification models we've imported
# Make sure the dictionary is in the same format as example_dict
# The models dictionary should contain 5 models
models = {"LinearSVC": LinearSVC(),
          "KNN": KNeighborsClassifier(),
          "SVC": SVC(),
          "LogisticRegression": LogisticRegression(),
          "RandomForestClassifier": RandomForestClassifier()}

# Create an empty dictionary called results
results = {}

Since each model we're using has the same fit() and score() functions, we can loop through our models dictionary and, call fit() on the training data and then call score() with the test data.

In [13]:

# EXAMPLE: Looping through example_dict fitting and scoring the model
example_results = {}
for model_name, model in example_dict.items():
    model.fit(X_train, y_train)
    example_results[model_name] = model.score(X_test, y_test)

example_results

Out[13]:

{'RandomForestClassifier': 0.7631578947368421}

In [14]:

# Loop through the models dictionary items, fitting the model on the training data
# and appending the model name and model score on the test data to the results dictionary
for model_name, model in models.items():
    model.fit(X_train, y_train)
    results[model_name] = model.score(X_test, y_test)
    
results

Out[14]:

/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

{'LinearSVC': 0.7631578947368421,
 'KNN': 0.6447368421052632,
 'SVC': 0.618421052631579,
 'LogisticRegression': 0.7894736842105263,
 'RandomForestClassifier': 0.7631578947368421}

Which model performed the best?
Do the results change each time you run the cell?
Why do you think this is?

Due to the randomness of how each model finds patterns in the data, you might notice different results each time.

Without manually setting the random state using the random_state parameter of some models or using a NumPy random seed, every time you run the cell, you'll get slightly different results.

Let's see this in effect by running the same code as the cell above, except this time setting a NumPy random seed equal to 42.

In [15]:

# Run the same code as the cell above, except this time set a NumPy random seed
# equal to 42
np.random.seed(42)

for model_name, model in models.items():
    model.fit(X_train, y_train)
    results[model_name] = model.score(X_test, y_test)
    
results

Out[15]:

/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

{'LinearSVC': 0.7236842105263158,
 'KNN': 0.6447368421052632,
 'SVC': 0.618421052631579,
 'LogisticRegression': 0.7894736842105263,
 'RandomForestClassifier': 0.7763157894736842}

Run the cell above a few times, what do you notice about the results?
Which model performs the best this time?
What happens if you add a NumPy random seed to the cell where you called train_test_split() (towards the top of the notebook) and then rerun the cell above?

Let's make our results a little more visual.

In [16]:

# Create a pandas dataframe with the data as the values of the results dictionary,
# the index as the keys of the results dictionary and a single column called accuracy.
# Be sure to save the dataframe to a variable.
results_df = pd.DataFrame(results.values(), 
                          results.keys(), 
                          columns=["Accuracy"])

# Create a bar plot of the results dataframe using plot.bar()
results_df.plot.bar();

Out[16]:

Using np.random.seed(42) results in the LogisticRegression model perfoming the best (at least on my computer).

Let's tune its hyperparameters and see if we can improve it.

Hyperparameter Tuning

Remember, if you're ever trying to tune a machine learning models hyperparameters and you're not sure where to start, you can always search something like "MODEL_NAME hyperparameter tuning".

In the case of LogisticRegression, you might come across articles, such as Hyperparameter Tuning Using Grid Search by Chris Albon.

The article uses GridSearchCV but we're going to be using RandomizedSearchCV.

The different hyperparameters to search over have been setup for you in log_reg_grid but feel free to change them.

In [17]:

# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

Since we've got a set of hyperparameters we can import RandomizedSearchCV, pass it our dictionary of hyperparameters and let it search for the best combination.

In [18]:

# Setup np random seed of 42
np.random.seed(42)

# Import RandomizedSearchCV from sklearn's model_selection module
from sklearn.model_selection import RandomizedSearchCV

# Setup an instance of RandomizedSearchCV with a LogisticRegression() estimator,
# our log_reg_grid as the param_distributions, a cv of 5 and n_iter of 5.
rs_log_reg = RandomizedSearchCV(estimator=LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=5,
                                verbose=True)

# Fit the instance of RandomizedSearchCV
rs_log_reg.fit(X_train, y_train);

Out[18]:

Fitting 5 folds for each of 5 candidates, totalling 25 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.3s finished

Once RandomizedSearchCV has finished, we can find the best hyperparmeters it found using the best_params_ attributes.

In [19]:

# Find the best parameters of the RandomizedSearchCV instance using the best_params_ attribute
rs_log_reg.best_params_

Out[19]:

{'solver': 'liblinear', 'C': 0.23357214690901212}

In [20]:

# Score the instance of RandomizedSearchCV using the test data
rs_log_reg.score(X_test, y_test)

Out[20]:

0.7894736842105263

After hyperparameter tuning, did the models score improve? What else could you try to improve it? Are there any other methods of hyperparameter tuning you can find for LogisticRegression?

Classifier Model Evaluation

We've tried to find the best hyperparameters on our model using RandomizedSearchCV and so far we've only been evaluating our model using the score() function which returns accuracy.

But when it comes to classification, you'll likely want to use a few more evaluation metrics, including:

Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line).
Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part and evaluates performance as an average.
Precision - Proportion of true positives over total number of samples. Higher precision leads to less false positives.
Recall - Proportion of true positives over total number of true positives and false positives. Higher recall leads to less false negatives.
F1 score - Combines precision and recall into one metric. 1 is best, 0 is worst.
Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.
ROC Curve - Receiver Operating Characterisitc is a plot of true positive rate versus false positive rate.
Area Under Curve (AUC) - The area underneath the ROC curve. A perfect model achieves a score of 1.0.

Before we get to these, we'll instantiate a new instance of our model using the best hyerparameters found by RandomizedSearchCV.

In [21]:

# Instantiate a LogisticRegression classifier using the best hyperparameters from RandomizedSearchCV
clf = LogisticRegression(solver="liblinear", C=0.23357214690901212)

# Fit the new instance of LogisticRegression with the best hyperparameters on the training data 
clf.fit(X_train, y_train);

Now it's to import the relative Scikit-Learn methods for each of the classification evaluation metrics we're after.

In [22]:

# Import confusion_matrix and classification_report from sklearn's metrics module
from sklearn.metrics import confusion_matrix, classification_report

# Import precision_score, recall_score and f1_score from sklearn's metrics module
from sklearn.metrics import precision_score, recall_score, f1_score

# Import plot_roc_curve from sklearn's metrics module
from sklearn.metrics import plot_roc_curve

Evaluation metrics are very often comparing a model's predictions to some ground truth labels.

Let's make some predictions on the test data using our latest model and save them to y_preds.

In [23]:

# Make predictions on test data and save them
y_preds = clf.predict(X_test)

Time to use the predictions our model has made to evaluate it beyond accuracy.

In [24]:

# Create a confusion matrix using the confusion_matrix function
confusion_matrix(y_test, y_preds)

Out[24]:

array([[24, 11],
       [ 5, 36]])

Challenge: The in-built confusion_matrix function in Scikit-Learn produces something not too visual, how could you make your confusion matrix more visual?

You might want to search something like "how to plot a confusion matrix". Note: There may be more than one way to do this.

In [25]:

# Import seaborn for improving visualisation of confusion matrix
import seaborn as sns

# Make confusion matrix more visual
def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False)
    plt.xlabel("True label")
    plt.ylabel("Predicted label")
    
    # Fix the broken annotations (this happened in Matplotlib 3.1.1)
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5);
    
plot_conf_mat(y_test, y_preds)

Out[25]:

How about a classification report?

In [26]:

# classification report
print(classification_report(y_test, y_preds))

Out[26]:

              precision    recall  f1-score   support

           0       0.83      0.69      0.75        35
           1       0.77      0.88      0.82        41

    accuracy                           0.79        76
   macro avg       0.80      0.78      0.78        76
weighted avg       0.79      0.79      0.79        76

Challenge: Write down what each of the columns in this classification report are.

Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
Support - The number of samples each metric was calculated on.
Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).

The classification report gives us a range of values for precision, recall and F1 score, time to find these metrics using Scikit-Learn functions.

In [27]:

# Find the precision score of the model using precision_score()
precision_score(y_test, y_preds)

Out[27]:

0.7659574468085106

In [28]:

# Find the recall score
recall_score(y_test, y_preds)

Out[28]:

0.8780487804878049

In [29]:

# Find the F1 score
f1_score(y_test, y_preds)

Out[29]:

0.8181818181818182

Confusion matrix: done. Classification report: done. ROC (receiver operator characteristic) curve & AUC (area under curve) score: not done.

Let's fix this.

If you're unfamiliar with what a ROC curve, that's your first challenge, to read up on what one is.

In a sentence, a ROC curve is a plot of the true positive rate versus the false positive rate.

And the AUC score is the area behind the ROC curve.

Scikit-Learn provides a handy function for creating both of these called plot_roc_curve().

In [30]:

# Plot a ROC curve using our current machine learning model using plot_roc_curve
plot_roc_curve(clf, X_test, y_test);

Out[30]:

Beautiful! We've gone far beyond accuracy with a plethora extra classification evaluation metrics.

If you're not sure about any of these, don't worry, they can take a while to understand. That could be an optional extension, reading up on a classification metric you're not sure of.

The thing to note here is all of these metrics have been calculated using a single training set and a single test set. Whilst this is okay, a more robust way is to calculate them using cross-validation.

We can calculate various evaluation metrics using cross-validation using Scikit-Learn's cross_val_score() function along with the scoring parameter.

In [31]:

# Import cross_val_score from sklearn's model_selection module
from sklearn.model_selection import cross_val_score

In [32]:

# EXAMPLE: By default cross_val_score returns 5 values (cv=5).
cross_val_score(clf, 
                X, 
                y, 
                scoring="accuracy",
                cv=5)

Out[32]:

array([0.81967213, 0.90163934, 0.8852459 , 0.88333333, 0.75      ])

In [33]:

# EXAMPLE: Taking the mean of the returned values from cross_val_score 
# gives a cross-validated version of the scoring metric.
cross_val_acc = np.mean(cross_val_score(clf,
                                        X,
                                        y,
                                        scoring="accuracy",
                                        cv=5))

cross_val_acc

Out[33]:

0.8479781420765027

In the examples, the cross-validated accuracy is found by taking the mean of the array returned by cross_val_score().

Now it's time to find the same for precision, recall and F1 score.

In [34]:

# Find the cross-validated precision
cross_val_precision = np.mean(cross_val_score(clf,
                                              X,
                                              y,
                                              scoring="precision",
                                              cv=5))

cross_val_precision

Out[34]:

0.8215873015873015

In [35]:

# Find the cross-validated recall
cross_val_recall = np.mean(cross_val_score(clf,
                                           X,
                                           y,
                                           scoring="recall",
                                           cv=5))

cross_val_recall

Out[35]:

0.9272727272727274

In [36]:

# Find the cross-validated F1 score
cross_val_f1 = np.mean(cross_val_score(clf,
                                       X,
                                       y,
                                       scoring="f1",
                                       cv=5))

cross_val_f1

Out[36]:

0.8705403543192143

Exporting and importing a trained model

Once you've trained a model, you may want to export it and save it to file so you can share it or use it elsewhere.

One method of exporting and importing models is using the joblib library.

In Scikit-Learn, exporting and importing a trained model is known as model persistence.

In [37]:

# Import the dump and load functions from the joblib library
from joblib import dump, load

In [38]:

# Use the dump function to export the trained model to file
dump(clf, "trained-classifier.joblib")

Out[38]:

['trained-classifier.joblib']

In [39]:

# Use the load function to import the trained model you just exported
# Save it to a different variable name to the origial trained model
loaded_clf = load("trained-classifier.joblib")

# Evaluate the loaded trained model on the test data
loaded_clf.score(X_test, y_test)

Out[39]:

0.7894736842105263

What do you notice about the loaded trained model results versus the original (pre-exported) model results?

Scikit-Learn Regression Practice

For the next few exercises, we're going to be working on a regression problem, in other words, using some data to predict a number.

Our dataset is a table of car sales, containing different car characteristics as well as a sale price.

We'll use Scikit-Learn's built-in regression machine learning models to try and learn the patterns in the car characteristics and their prices on a certain group of the dataset before trying to predict the sale price of a group of cars the model has never seen before.

To begin, we'll import the data from GitHub into a pandas DataFrame, check out some details about it and try to build a model as soon as possible.

In [40]:

# Read in the car sales data
car_sales = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")

# View the first 5 rows of the car sales data
car_sales.head()

Out[40]:

In [41]:

# Get information about the car sales DataFrame
car_sales.info()

Out[41]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
Make             951 non-null object
Colour           950 non-null object
Odometer (KM)    950 non-null float64
Doors            950 non-null float64
Price            950 non-null float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB

Looking at the output of info(),

How many rows are there total?
What datatypes are in each column?
How many missing values are there in each column?

In [42]:

# Find number of missing values in each column
car_sales.isna().sum()

Out[42]:

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [43]:

# Find the datatypes of each column of car_sales
car_sales.dtypes

Out[43]:

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

Knowing this information, what would happen if we tried to model our data as it is?

Let's see.

In [44]:

# EXAMPLE: This doesn't work because our car_sales data isn't all numerical
from sklearn.ensemble import RandomForestRegressor
car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sales.Price
rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)

Out[44]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-476d8071e1b5> in <module>
      2 from sklearn.ensemble import RandomForestRegressor
      3 car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sales.Price
----> 4 rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)

~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    293         """
    294         # Validate or convert input data
--> 295         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    296         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    297         if sample_weight is not None:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    513                     array = array.astype(dtype, casting="unsafe", copy=False)
    514                 else:
--> 515                     array = np.asarray(array, order=order, dtype=dtype)
    516             except ComplexWarning:
    517                 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 
ValueError: could not convert string to float: 'Honda'

As we see, the cell above breaks because our data contains non-numerical values as well as missing data.

To take care of some of the missing data, we'll remove the rows which have no labels (all the rows with missing values in the Price column).

In [46]:

# Remove rows with no labels (NaN's in the Price column)
car_sales.dropna(subset=["Price"], inplace=True)

Building a pipeline

Since our car_sales data has missing numerical values as well as the data isn't all numerical, we'll have to fix these things before we can fit a machine learning model on it.

There are ways we could do this with pandas but since we're practicing Scikit-Learn, we'll see how we might do it with the Pipeline class.

Because we're modifying columns in our dataframe (filling missing values, converting non-numerical data to numbers) we'll need the ColumnTransformer, SimpleImputer and OneHotEncoder classes as well.

Finally, because we'll need to split our data into training and test sets, we'll import train_test_split as well.

In [47]:

# Import Pipeline from sklearn's pipeline module
from sklearn.pipeline import Pipeline

# Import ColumnTransformer from sklearn's compose module
from sklearn.compose import ColumnTransformer

# Import SimpleImputer from sklearn's impute module
from sklearn.impute import SimpleImputer

# Import OneHotEncoder from sklearn's preprocessing module
from sklearn.preprocessing import OneHotEncoder

# Import train_test_split from sklearn's model_selection module
from sklearn.model_selection import train_test_split

Now we've got the necessary tools we need to create our preprocessing Pipeline which fills missing values along with turning all non-numerical data into numbers.

Let's start with the categorical features.

In [48]:

# Define different categorical features 
categorical_features = ["Make", "Colour"]

# Create categorical transformer Pipeline
categorical_transformer = Pipeline(steps=[
    # Set SimpleImputer strategy to "constant" and fill value to "missing"
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    # Set OneHotEncoder to ignore the unknowns
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

It would be safe to treat Doors as a categorical feature as well, however since we know the vast majority of cars have 4 doors, we'll impute the missing Doors values as 4.

In [49]:

# Define Doors features
door_feature = ["Doors"]

# Create Doors transformer Pipeline
door_transformer = Pipeline(steps=[
    # Set SimpleImputer strategy to "constant" and fill value to 4
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

Now onto the numeric features. In this case, the only numeric feature is the Odometer (KM) column. Let's fill its missing values with the median.

In [50]:

# Define numeric features (only the Odometer (KM) column)
numeric_features = ["Odometer (KM)"]

# Crearte numeric transformer Pipeline
numeric_transformer = Pipeline(steps=[
    # Set SimpleImputer strategy to fill missing values with the "Median"
    ("imputer", SimpleImputer(strategy="median"))])

Time to put all of our individual transformer Pipeline's into a single ColumnTransformer instance.

In [51]:

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        # Use the categorical_transformer to transform the categorical_features
        ("cat", categorical_transformer, categorical_features),
        # Use the door_transformer to transform the door_feature
        ("door", door_transformer, door_feature),
        # Use the numeric_transformer to transform the numeric_features
        ("num", numeric_transformer, numeric_features)])

Boom! Now our preprocessor is ready, time to import some regression models to try out.

Comparing our data to the Scikit-Learn machine learning map, we can see there's a handful of different regression models we can try.

RidgeRegression
SVR(kernel="linear") - short for Support Vector Regressor, a form form of support vector machine.
SVR(kernel="rbf") - short for Support Vector Regressor, a form of support vector machine.
RandomForestRegressor - the regression version of RandomForestClassifier.

In [52]:

# Import Ridge from sklearn's linear_model module
from sklearn.linear_model import Ridge

# Import SVR from sklearn's svm module
from sklearn.svm import SVR

# Import RandomForestRegressor from sklearn's ensemble module
from sklearn.ensemble import RandomForestRegressor

Again, thanks to the design of the Scikit-Learn library, we're able to use very similar code for each of these models.

To test them all, we'll create a dictionary of regression models and an empty dictionary for regression model results.

In [53]:

# Create dictionary of model instances, there should be 4 total key, value pairs
# in the form {"model_name": model_instance}.
# Don't forget there's two versions of SVR, one with a "linear" kernel and the
# other with kernel set to "rbf".
regression_models = {"Ridge": Ridge(),
                     "SVR_linear": SVR(kernel="linear"),
                     "SVR_rbf": SVR(kernel="rbf"),
                     "RandomForestRegressor": RandomForestRegressor()}

# Create an empty dictionary for the regression results
regression_results = {}

Our regression model dictionary is prepared as well as an empty dictionary to append results to, time to get the data split into X (feature variables) and y (target variable) as well as training and test sets.

In our car sales problem, we're trying to use the different characteristics of a car (X) to predict its sale price (y).

In [54]:

# Create car sales X data (every column of car_sales except Price)
car_sales_X = car_sales.drop("Price", axis=1)

# Create car sales y data (the Price column of car_sales)
car_sales_y = car_sales["Price"]

In [55]:

# Use train_test_split to split the car_sales_X and car_sales_y data into 
# training and test sets.
# Give the test set 20% of the data using the test_size parameter.
# For reproducibility set the random_state parameter to 42.
car_X_train, car_X_test, car_y_train, car_y_test = train_test_split(car_sales_X,
                                                                    car_sales_y,
                                                                    test_size=0.2,
                                                                    random_state=42)

# Check the shapes of the training and test datasets
car_X_train.shape, car_X_test.shape, car_y_train.shape, car_y_test.shape

Out[55]:

((760, 4), (190, 4), (760,), (190,))

How many rows are in each set?
How many columns are in each set?

Alright, our data is split into training and test sets, time to build a small loop which is going to:

Go through our regression_models dictionary
Create a Pipeline which contains our preprocessor as well as one of the models in the dictionary
Fits the Pipeline to the car sales training data
Evaluates the target model on the car sales test data and appends the results to our regression_results dictionary

In [56]:

# Loop through the items in the regression_models dictionary
for model_name, model in regression_models.items():
    
    # Create a model pipeline with a preprocessor step and model step
    model_pipeline = Pipeline(steps=[("preprocessor", preprocessor),
                                      ("model", model)])
    
    # Fit the model pipeline to the car sales training data
    print(f"Fitting {model_name}...")
    model_pipeline.fit(car_X_train, car_y_train)
    
    # Score the model pipeline on the test data appending the model_name to the 
    # results dictionary
    print(f"Scoring {model_name}...")
    regression_results[model_name] = model_pipeline.score(car_X_test, 
                                                          car_y_test)

Out[56]:

Fitting Ridge...
Scoring Ridge...
Fitting SVR_linear...
Scoring SVR_linear...
Fitting SVR_rbf...
Scoring SVR_rbf...
Fitting RandomForestRegressor...
Scoring RandomForestRegressor...

Our regression models have been fit, let's see how they did!

In [57]:

# Check the results of each regression model by printing the regression_results
# dictionary
regression_results

Out[57]:

{'Ridge': 0.25402611057943925,
 'SVR_linear': -0.489452821008145,
 'SVR_rbf': 0.0018546241516633755,
 'RandomForestRegressor': 0.2291358152962253}

Which model did the best?
How could you improve its results?
What metric does the score() method of a regression model return by default?

Since we've fitted some models but only compared them via the default metric contained in the score() method (R^2 score or coefficient of determination), let's take the RidgeRegression model and evaluate it with a few other regression metrics.

Specifically, let's find:

R^2 (pronounced r-squared) or coefficient of determination - Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

Scikit-Learn has a few classes built-in which are going to help us with these, namely, mean_absolute_error, mean_squared_error and r2_score.

In [58]:

# Import mean_absolute_error from sklearn's metrics module
from sklearn.metrics import mean_absolute_error

# Import mean_squared_error from sklearn's metrics module
from sklearn.metrics import mean_squared_error

# Import r2_score from sklearn's metrics module
from sklearn.metrics import r2_score

All the evaluation metrics we're concerned with compare a model's predictions with the ground truth labels. Knowing this, we'll have to make some predictions.

Let's create a Pipeline with the preprocessor and a Ridge() model, fit it on the car sales training data and then make predictions on the car sales test data.

In [59]:

# Create RidgeRegression Pipeline with preprocessor as the "preprocessor" and
# Ridge() as the "model".
ridge_pipeline = Pipeline(steps=[("preprocessor", preprocessor),
                                 ("model", Ridge())])

# Fit the RidgeRegression Pipeline to the car sales training data
ridge_pipeline.fit(car_X_train, car_y_train)

# Make predictions on the car sales test data using the RidgeRegression Pipeline
car_y_preds = ridge_pipeline.predict(car_X_test)

# View the first 50 predictions
car_y_preds[:50]

Out[59]:

array([18514.480873  , 22204.86580267, 11045.72092225,  6891.87687957,
        8793.93970278, 10926.96786577, 15267.76229871, 13834.80312146,
       20207.44377898, 14412.59968701,  6216.01228306, 16543.06707068,
       11783.69374936, 13492.13284255, 14321.17899086, 16425.45131776,
       15999.88230172,  9925.04252351, 11576.64865889, 11580.55130633,
       10647.22402588, 13113.54464064, 17874.90033257, 23447.29246732,
       11797.33019118, 14481.85336505, 18430.527126  , 14680.33044208,
       20585.63901269, 19953.91395769, 18163.1170151 , 22366.43390174,
       12435.69707467, 14265.25495748, 18351.62656383, 19564.58267503,
       12213.05747919, 12479.20772529, 18677.93478691, 11284.89072638,
       15278.54001605, 17378.81425165, 19247.77623181, 17343.90692672,
       15048.12797569, 12724.44477165, 12389.86021752,  8475.63778401,
       15255.49324105, 18602.79970861])

Nice! Now we've got some predictions, time to evaluate them. We'll find the mean squared error (MSE), mean absolute error (MAE) and R^2 score (coefficient of determination) of our model.

In [60]:

# EXAMPLE: Find the MSE by comparing the car sales test labels to the car sales predictions
mse = mean_squared_error(car_y_test, car_y_preds)
# Return the MSE
mse

Out[60]:

49950182.633374564

In [61]:

# Find the MAE by comparing the car sales test labels to the car sales predictions
mae =  mean_absolute_error(car_y_test, car_y_preds)
# Return the MAE
mae

Out[61]:

5713.8215208551555

In [62]:

# Find the R^2 score by comparing the car sales test labels to the car sales predictions
r2 = r2_score(car_y_test, car_y_preds)
# Return the R^2 score
r2

Out[62]:

0.25402611057943925

Boom! Our model could potentially do with some hyperparameter tuning (this would be a great extension). And we could probably do with finding some more data on our problem, 1000 rows doesn't seem to be sufficient.

How would you export the trained regression model?

Extensions

You should be proud. Getting this far means you've worked through a classification problem and regression problem using pure (mostly) Scikit-Learn (no easy feat!).

For more exercises, check out the Scikit-Learn getting started documentation. A good practice would be to read through it and for the parts you find interesting, add them into the end of this notebook.

Finally, as always, remember, the best way to learn something new is to try it. And try it relentlessly. If you're unsure of how to do something, never be afraid to ask a question or search for something such as, "how to tune the hyperparmaters of a scikit-learn ridge regression model".