GitHub Repository: mrdbourke/zero-to-mastery-ml
Path: blob/master/section-2-data-science-and-ml-tools/scikit-learn-what-were-covering.ipynb
⁸⁷⁴ views

Kernel: Python 3

What we're covering in the Scikit-Learn Introduction

This notebook outlines the content convered in the Scikit-Learn Introduction.

It's a quick stop to see all the Scikit-Learn functions and modules for each section outlined.

What we're covering follows the following diagram detailing a Scikit-Learn workflow.

0. Standard library imports

For all machine learning projects, you'll often see these libraries (Matplotlib, NumPy and pandas) imported at the top.

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

We'll use 2 datasets for demonstration purposes.

heart_disease - a classification dataset (predicting whether someone has heart disease or not)
boston_df - a regression dataset (predicting the median house prices of cities in Boston)

In [2]:

# Classification data
heart_disease = pd.read_csv("../data/heart-disease.csv")

# Regression data
from sklearn.datasets import load_boston
boston = load_boston() # loads as dictionary
# Convert dictionary to dataframe
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])

1. Get the data ready

In [3]:

# Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X

In [4]:

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
# Example use case (requires X & y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

2. Pick a model/estimator (to suit your problem)

To pick a model we use the Scikit-Learn machine learning map.

Note: Scikit-Learn refers to machine learning models and algorithms as estimators.

In [5]:

# Random Forest Classifier (for classification problems)
from sklearn.ensemble import RandomForestClassifier
# Instantiating a Random Forest Classifier (clf short for classifier)
clf = RandomForestClassifier()

In [6]:

# Random Forest Regressor (for regression problems)
from sklearn.ensemble import RandomForestRegressor
# Instantiating a Random Forest Regressor
model = RandomForestRegressor()

3. Fit the model to the data and make a prediction

In [7]:

# All models/estimators have the fit() function built-in
clf.fit(X_train, y_train)

# Once fit is called, you can make predictions using predict()
y_preds = clf.predict(X_test)

# You can also predict with probabilities (on classification models)
y_probs = clf.predict_proba(X_test)

# View preds/probabilities
y_preds, y_probs

Out[7]:

/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

(array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
        0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
        1, 1, 1, 1, 1, 0, 1, 0, 1, 0]), array([[0.5, 0.5],
        [0.2, 0.8],
        [0.4, 0.6],
        [0.5, 0.5],
        [0.8, 0.2],
        [0.9, 0.1],
        [0.5, 0.5],
        [0.9, 0.1],
        [0.2, 0.8],
        [1. , 0. ],
        [0.5, 0.5],
        [0.8, 0.2],
        [0.5, 0.5],
        [0.5, 0.5],
        [0.2, 0.8],
        [0.5, 0.5],
        [0.6, 0.4],
        [0.2, 0.8],
        [0.4, 0.6],
        [1. , 0. ],
        [0.5, 0.5],
        [0.5, 0.5],
        [0.9, 0.1],
        [0.6, 0.4],
        [0.2, 0.8],
        [0.4, 0.6],
        [0.5, 0.5],
        [0.4, 0.6],
        [0.9, 0.1],
        [0.7, 0.3],
        [0.7, 0.3],
        [0.2, 0.8],
        [1. , 0. ],
        [0.1, 0.9],
        [0.6, 0.4],
        [0.8, 0.2],
        [1. , 0. ],
        [0.1, 0.9],
        [1. , 0. ],
        [0.5, 0.5],
        [0.6, 0.4],
        [1. , 0. ],
        [0.3, 0.7],
        [0. , 1. ],
        [0.9, 0.1],
        [0.6, 0.4],
        [0. , 1. ],
        [0.3, 0.7],
        [0.8, 0.2],
        [0.6, 0.4],
        [0.4, 0.6],
        [0.6, 0.4],
        [0.2, 0.8],
        [0.4, 0.6],
        [0. , 1. ],
        [0.9, 0.1],
        [0.8, 0.2],
        [0.3, 0.7],
        [0. , 1. ],
        [0.9, 0.1],
        [0.1, 0.9],
        [0.1, 0.9],
        [0.3, 0.7],
        [1. , 0. ],
        [0.9, 0.1],
        [0.6, 0.4],
        [0.3, 0.7],
        [0.3, 0.7],
        [0. , 1. ],
        [0.3, 0.7],
        [0.1, 0.9],
        [0.6, 0.4],
        [0. , 1. ],
        [0.7, 0.3],
        [0. , 1. ],
        [1. , 0. ]]))

4. Evaluate the model

Every Scikit-Learn model has a default metric which is accessible through the score() function.

However there are a range of different evaluation metrics you can use depending on the model you're using.

A full list of evaluation metrics can be found in the documentation.

In [8]:

# All models/estimators have a score() function
clf.score(X_test, y_test)

Out[8]:

0.8026315789473685

In [9]:

# Evaluting a model using cross-validation is possible with cross_val_score
from sklearn.model_selection import cross_val_score

# scoring=None means default score() metric is used
print(cross_val_score(estimator=clf, 
                      X=X, 
                      y=y, 
                      cv=5, # use 5-fold cross-validation
                      scoring=None)) 

# Evaluate a model with a different scoring method
print(cross_val_score(estimator=clf, 
                      X=X, 
                      y=y,
                      cv=5, # use 5-fold cross-validation
                      scoring="precision"))

Out[9]:

[0.78688525 0.86885246 0.7704918  0.78333333 0.81666667]
[0.8        0.92592593 0.85185185 0.83870968 0.75      ]

In [10]:

# Different classification metrics

# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))

# Reciver Operating Characteristic (ROC curve)/Area under curve (AUC)
from sklearn.metrics import roc_curve, roc_auc_score
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_probs[:, 1])
print(roc_auc_score(y_test, y_preds))

# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

Out[10]:

0.8026315789473685
0.804920304920305
[[33  4]
 [11 28]]
              precision    recall  f1-score   support

           0       0.75      0.89      0.81        37
           1       0.88      0.72      0.79        39

    accuracy                           0.80        76
   macro avg       0.81      0.80      0.80        76
weighted avg       0.81      0.80      0.80        76

In [11]:

# Different regression metrics

# Make predictions first
X = boston_df.drop("target", axis=1)
y = boston_df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

# R^2 (pronounced r-squared) or coefficient of determination
from sklearn.metrics import r2_score
print(r2_score(y_test, y_preds))

# Mean absolute error (MAE)
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test, y_preds))

# Mean square error (MSE)
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_preds))

Out[11]:

8987155770408454
9618627450980388
75367352941176

/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

5. Improve through experimentation

Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective asks:

Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.
Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective asks:

Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the Scikit-Learn machine learning map, ensemble methods are generally considered more complex models)
Could we improve the current model? If the model you're using performs well straight out of the box, can the hyperparameters be tuned to make it even better?

Hyperparameters are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

In [12]:

# How to find a model's hyperparameters
clf = RandomForestClassifier()
clf.get_params() # returns a list of adjustable hyperparameters

Out[12]:

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [13]:

# Example of adjusting hyperparameters by hand

# Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate two models with different settings
clf_1 = RandomForestClassifier(n_estimators=100)
clf_2 = RandomForestClassifier(n_estimators=200)

# Fit both models on training data
clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)

# Evaluate both models on test data and see which is best
print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))

Out[13]:

0.868421052631579
0.8552631578947368

In [14]:

# Example of adjusting hyperparameters computationally (recommended)

from sklearn.model_selection import RandomizedSearchCV

# Define a grid of hyperparameters
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec 2019, using n_jobs=1 works)
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # try 10 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

# Find the best hyperparameters
print(rs_clf.best_params_)

# Scoring automatically uses the best hyperparameters
rs_clf.score(X_test, y_test)

Out[14]:

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.2s
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s

[CV]  n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.2s
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.1s
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.1s
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.2s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None 
[CV]  n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total=   0.2s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None 
[CV]  n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total=   0.3s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None 
[CV]  n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total=   0.3s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None 
[CV]  n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total=   0.3s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None 
[CV]  n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total=   0.2s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 
[CV]  n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 
[CV]  n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 
[CV]  n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 
[CV]  n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 
[CV]  n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total=   0.0s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 
[CV]  n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total=   0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 
[CV]  n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total=   0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 
[CV]  n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total=   0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 
[CV]  n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total=   0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 
[CV]  n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total=   0.1s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total=   1.4s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total=   1.5s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total=   1.4s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total=   1.9s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total=   2.2s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total=   2.8s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total=   1.7s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total=   1.6s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total=   1.3s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None 
[CV]  n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total=   2.0s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.6s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.6s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   1.1s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.7s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 
[CV]  n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total=   0.6s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total=   1.4s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total=   1.4s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total=   1.3s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total=   1.3s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total=   1.3s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total=   0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 
[CV]  n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total=   0.0s

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   31.0s finished
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

{'n_estimators': 1000, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 5}

0.819672131147541

6. Save and reload your trained model

You can save and load a model with pickle.

In [15]:

# Saving a model with pickle
import pickle

# Save an existing model to file
pickle.dump(rs_clf, open("rs_random_forest_model_1.pkl", "wb"))

In [16]:

# Load a saved pickle model
loaded_pickle_model = pickle.load(open("rs_random_forest_model_1.pkl", "rb"))

# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)

Out[16]:

0.819672131147541

You can do the same with joblib. joblib is usually more efficient with numerical data (what our models are).

In [17]:

# Saving a model with joblib
from joblib import dump, load

# Save a model to file
dump(rs_clf, filename="gs_random_forest_model_1.joblib")

Out[17]:

['gs_random_forest_model_1.joblib']

In [18]:

# Import a saved joblib model
loaded_joblib_model = load(filename="gs_random_forest_model_1.joblib")

In [19]:

# Evaluate joblib predictions 
loaded_joblib_model.score(X_test, y_test)

Out[19]:

0.819672131147541

7. Putting it all together (not pictured)

We can put a number of different Scikit-Learn functions together using Pipeline.

As an example, we'll use car-sales-extended-missing-data.csv. Which has missing data as well as non-numeric data. For a machine learning model to work, there can be no missing data or non-numeric values.

The problem we're solving here is predicting a cars sales price given a number of parameters about the car (a regression problem).

In [20]:

# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop the rows with missing labels
data = pd.read_csv("../data/car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace=True)

# Define different features and transformer pipelines
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

numeric_features = ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("door", door_transformer, door_feature),
        ("num", numeric_transformer, numeric_features)])

# Create a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("model", RandomForestRegressor())])

# Split data
X = data.drop("Price", axis=1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)

Out[20]:

/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

0.1821575815702311