Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
mrdbourke
GitHub Repository: mrdbourke/zero-to-mastery-ml
Path: blob/master/section-2-data-science-and-ml-tools/scikit-learn-what-were-covering.ipynb
874 views
Kernel: Python 3

What we're covering in the Scikit-Learn Introduction

This notebook outlines the content convered in the Scikit-Learn Introduction.

It's a quick stop to see all the Scikit-Learn functions and modules for each section outlined.

What we're covering follows the following diagram detailing a Scikit-Learn workflow.

0. Standard library imports

For all machine learning projects, you'll often see these libraries (Matplotlib, NumPy and pandas) imported at the top.

%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd

We'll use 2 datasets for demonstration purposes.

  • heart_disease - a classification dataset (predicting whether someone has heart disease or not)

  • boston_df - a regression dataset (predicting the median house prices of cities in Boston)

# Classification data heart_disease = pd.read_csv("../data/heart-disease.csv") # Regression data from sklearn.datasets import load_boston boston = load_boston() # loads as dictionary # Convert dictionary to dataframe boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"]) boston_df["target"] = pd.Series(boston["target"])

1. Get the data ready

# Split data into X & y X = heart_disease.drop("target", axis=1) # use all columns except target y = heart_disease["target"] # we want to predict y using X
# Split the data into training and test sets from sklearn.model_selection import train_test_split # Example use case (requires X & y) X_train, X_test, y_train, y_test = train_test_split(X, y)

2. Pick a model/estimator (to suit your problem)

To pick a model we use the Scikit-Learn machine learning map.

Note: Scikit-Learn refers to machine learning models and algorithms as estimators.

# Random Forest Classifier (for classification problems) from sklearn.ensemble import RandomForestClassifier # Instantiating a Random Forest Classifier (clf short for classifier) clf = RandomForestClassifier()
# Random Forest Regressor (for regression problems) from sklearn.ensemble import RandomForestRegressor # Instantiating a Random Forest Regressor model = RandomForestRegressor()

3. Fit the model to the data and make a prediction

# All models/estimators have the fit() function built-in clf.fit(X_train, y_train) # Once fit is called, you can make predictions using predict() y_preds = clf.predict(X_test) # You can also predict with probabilities (on classification models) y_probs = clf.predict_proba(X_test) # View preds/probabilities y_preds, y_probs
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)
(array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0]), array([[0.5, 0.5], [0.2, 0.8], [0.4, 0.6], [0.5, 0.5], [0.8, 0.2], [0.9, 0.1], [0.5, 0.5], [0.9, 0.1], [0.2, 0.8], [1. , 0. ], [0.5, 0.5], [0.8, 0.2], [0.5, 0.5], [0.5, 0.5], [0.2, 0.8], [0.5, 0.5], [0.6, 0.4], [0.2, 0.8], [0.4, 0.6], [1. , 0. ], [0.5, 0.5], [0.5, 0.5], [0.9, 0.1], [0.6, 0.4], [0.2, 0.8], [0.4, 0.6], [0.5, 0.5], [0.4, 0.6], [0.9, 0.1], [0.7, 0.3], [0.7, 0.3], [0.2, 0.8], [1. , 0. ], [0.1, 0.9], [0.6, 0.4], [0.8, 0.2], [1. , 0. ], [0.1, 0.9], [1. , 0. ], [0.5, 0.5], [0.6, 0.4], [1. , 0. ], [0.3, 0.7], [0. , 1. ], [0.9, 0.1], [0.6, 0.4], [0. , 1. ], [0.3, 0.7], [0.8, 0.2], [0.6, 0.4], [0.4, 0.6], [0.6, 0.4], [0.2, 0.8], [0.4, 0.6], [0. , 1. ], [0.9, 0.1], [0.8, 0.2], [0.3, 0.7], [0. , 1. ], [0.9, 0.1], [0.1, 0.9], [0.1, 0.9], [0.3, 0.7], [1. , 0. ], [0.9, 0.1], [0.6, 0.4], [0.3, 0.7], [0.3, 0.7], [0. , 1. ], [0.3, 0.7], [0.1, 0.9], [0.6, 0.4], [0. , 1. ], [0.7, 0.3], [0. , 1. ], [1. , 0. ]]))

4. Evaluate the model

Every Scikit-Learn model has a default metric which is accessible through the score() function.

However there are a range of different evaluation metrics you can use depending on the model you're using.

A full list of evaluation metrics can be found in the documentation.

# All models/estimators have a score() function clf.score(X_test, y_test)
0.8026315789473685
# Evaluting a model using cross-validation is possible with cross_val_score from sklearn.model_selection import cross_val_score # scoring=None means default score() metric is used print(cross_val_score(estimator=clf, X=X, y=y, cv=5, # use 5-fold cross-validation scoring=None)) # Evaluate a model with a different scoring method print(cross_val_score(estimator=clf, X=X, y=y, cv=5, # use 5-fold cross-validation scoring="precision"))
[0.78688525 0.86885246 0.7704918 0.78333333 0.81666667] [0.8 0.92592593 0.85185185 0.83870968 0.75 ]
# Different classification metrics # Accuracy from sklearn.metrics import accuracy_score print(accuracy_score(y_test, y_preds)) # Reciver Operating Characteristic (ROC curve)/Area under curve (AUC) from sklearn.metrics import roc_curve, roc_auc_score false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_probs[:, 1]) print(roc_auc_score(y_test, y_preds)) # Confusion matrix from sklearn.metrics import confusion_matrix print(confusion_matrix(y_test, y_preds)) # Classification report from sklearn.metrics import classification_report print(classification_report(y_test, y_preds))
0.8026315789473685 0.804920304920305 [[33 4] [11 28]] precision recall f1-score support 0 0.75 0.89 0.81 37 1 0.88 0.72 0.79 39 accuracy 0.80 76 macro avg 0.81 0.80 0.80 76 weighted avg 0.81 0.80 0.80 76
# Different regression metrics # Make predictions first X = boston_df.drop("target", axis=1) y = boston_df["target"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = RandomForestRegressor() model.fit(X_train, y_train) y_preds = model.predict(X_test) # R^2 (pronounced r-squared) or coefficient of determination from sklearn.metrics import r2_score print(r2_score(y_test, y_preds)) # Mean absolute error (MAE) from sklearn.metrics import mean_absolute_error print(mean_absolute_error(y_test, y_preds)) # Mean square error (MSE) from sklearn.metrics import mean_squared_error print(mean_squared_error(y_test, y_preds))
0.8987155770408454 1.9618627450980388 7.75367352941176
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)

5. Improve through experimentation

Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective asks:

  • Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.

  • Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective asks:

  • Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the Scikit-Learn machine learning map, ensemble methods are generally considered more complex models)

  • Could we improve the current model? If the model you're using performs well straight out of the box, can the hyperparameters be tuned to make it even better?

Hyperparameters are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

# How to find a model's hyperparameters clf = RandomForestClassifier() clf.get_params() # returns a list of adjustable hyperparameters
{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
# Example of adjusting hyperparameters by hand # Split data into X & y X = heart_disease.drop("target", axis=1) # use all columns except target y = heart_disease["target"] # we want to predict y using X # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y) # Instantiate two models with different settings clf_1 = RandomForestClassifier(n_estimators=100) clf_2 = RandomForestClassifier(n_estimators=200) # Fit both models on training data clf_1.fit(X_train, y_train) clf_2.fit(X_train, y_train) # Evaluate both models on test data and see which is best print(clf_1.score(X_test, y_test)) print(clf_2.score(X_test, y_test))
0.868421052631579 0.8552631578947368
# Example of adjusting hyperparameters computationally (recommended) from sklearn.model_selection import RandomizedSearchCV # Define a grid of hyperparameters grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200], "max_depth": [None, 5, 10, 20, 30], "max_features": ["auto", "sqrt"], "min_samples_split": [2, 4, 6], "min_samples_leaf": [1, 2, 4]} # Split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec 2019, using n_jobs=1 works) clf = RandomForestClassifier(n_jobs=1) # Setup RandomizedSearchCV rs_clf = RandomizedSearchCV(estimator=clf, param_distributions=grid, n_iter=10, # try 10 models total cv=5, # 5-fold cross-validation verbose=2) # print out results # Fit the RandomizedSearchCV version of clf rs_clf.fit(X_train, y_train); # Find the best hyperparameters print(rs_clf.best_params_) # Scoring automatically uses the best hyperparameters rs_clf.score(X_test, y_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.2s [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s
[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.2s [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.1s [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.1s [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.2s [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total= 0.2s [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total= 0.3s [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total= 0.3s [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total= 0.3s [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None [CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=None, total= 0.2s [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20 [CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4, max_features=sqrt, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, total= 0.0s [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total= 0.1s [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total= 0.1s [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total= 0.1s [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total= 0.1s [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20 [CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1, max_features=sqrt, max_depth=20, total= 0.1s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total= 1.4s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total= 1.5s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total= 1.4s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total= 1.9s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5 [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2, max_features=auto, max_depth=5, total= 2.2s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total= 2.8s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total= 1.7s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total= 1.6s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total= 1.3s [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None [CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1, max_features=auto, max_depth=None, total= 2.0s [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.6s [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.6s [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 1.1s [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.7s [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10 [CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2, max_features=sqrt, max_depth=10, total= 0.6s [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total= 1.4s [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total= 1.4s [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total= 1.3s [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total= 1.3s [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4, max_features=auto, max_depth=30, total= 1.3s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total= 0.0s [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30 [CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=30, total= 0.0s
[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 31.0s finished /Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal. DeprecationWarning)
{'n_estimators': 1000, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 5}
0.819672131147541

6. Save and reload your trained model

You can save and load a model with pickle.

# Saving a model with pickle import pickle # Save an existing model to file pickle.dump(rs_clf, open("rs_random_forest_model_1.pkl", "wb"))
# Load a saved pickle model loaded_pickle_model = pickle.load(open("rs_random_forest_model_1.pkl", "rb")) # Evaluate loaded model loaded_pickle_model.score(X_test, y_test)
0.819672131147541

You can do the same with joblib. joblib is usually more efficient with numerical data (what our models are).

# Saving a model with joblib from joblib import dump, load # Save a model to file dump(rs_clf, filename="gs_random_forest_model_1.joblib")
['gs_random_forest_model_1.joblib']
# Import a saved joblib model loaded_joblib_model = load(filename="gs_random_forest_model_1.joblib")
# Evaluate joblib predictions loaded_joblib_model.score(X_test, y_test)
0.819672131147541

7. Putting it all together (not pictured)

We can put a number of different Scikit-Learn functions together using Pipeline.

As an example, we'll use car-sales-extended-missing-data.csv. Which has missing data as well as non-numeric data. For a machine learning model to work, there can be no missing data or non-numeric values.

The problem we're solving here is predicting a cars sales price given a number of parameters about the car (a regression problem).

# Getting data ready import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Modelling from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split, GridSearchCV # Setup random seed import numpy as np np.random.seed(42) # Import data and drop the rows with missing labels data = pd.read_csv("../data/car-sales-extended-missing-data.csv") data.dropna(subset=["Price"], inplace=True) # Define different features and transformer pipelines categorical_features = ["Make", "Colour"] categorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("onehot", OneHotEncoder(handle_unknown="ignore"))]) door_feature = ["Doors"] door_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="constant", fill_value=4))]) numeric_features = ["Odometer (KM)"] numeric_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="mean")) ]) # Setup preprocessing steps (fill missing values, then convert to numbers) preprocessor = ColumnTransformer( transformers=[ ("cat", categorical_transformer, categorical_features), ("door", door_transformer, door_feature), ("num", numeric_transformer, numeric_features)]) # Create a preprocessing and modelling pipeline model = Pipeline(steps=[("preprocessor", preprocessor), ("model", RandomForestRegressor())]) # Split data X = data.drop("Price", axis=1) y = data["Price"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Fit and score the model model.fit(X_train, y_train) model.score(X_test, y_test)
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)
0.1821575815702311