Path: blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-video.ipynb
874 views
Introduction to Scikit-Learn (sklearn)
This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.
What we're going to cover:
0. An end-to-end Scikit-Learn workflow
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-7cea9660990e> in <module>
1 # make a prediction
----> 2 y_label = clf.predict(np.array([0, 2, 3, 4]))
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
795 The predicted classes.
796 """
--> 797 proba = self.predict_proba(X)
798
799 if self.n_outputs_ == 1:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
837 check_is_fitted(self)
838 # Check data
--> 839 X = self._validate_X_predict(X)
840
841 # Assign chunk of trees to jobs
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
566 Validate X whenever one tries to predict, apply, predict_proba."""
567 check_is_fitted(self)
--> 568 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
569 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
570 raise ValueError("No support for np.int64 index based sparse matrices")
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
555 raise ValueError("Validation should be done on X, y or both.")
556 elif not no_val_X and no_val_y:
--> 557 X = check_array(X, **check_params)
558 out = X
559 elif no_val_X and not no_val_y:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
759 # If input is 1D raise error
760 if array.ndim == 1:
--> 761 raise ValueError(
762 "Expected 2D array, got 1D array instead:\narray={}.\n"
763 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
1. Getting our data ready to be used with machine learning
Three main things we have to do: 1. Split the data into features and labels (usually X
& y
) 2. Filling (also called imputing) or disregarding missing values 3. Converting non-numerical values to numerical values (also called feature encoding)
1.1 Make sure it's all numerical
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-2eeea2d0b490> in <module>
3
4 model = RandomForestRegressor()
----> 5 model.fit(X_train, y_train)
6 model.score(X_test, y_test)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
324 if issparse(y):
325 raise ValueError("sparse multilabel-indicator for y is not supported.")
--> 326 X, y = self._validate_data(
327 X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE
328 )
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
570 y = check_array(y, **check_y_params)
571 else:
--> 572 X, y = check_X_y(X, y, **check_params)
573 out = X, y
574
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
954 raise ValueError("y cannot be None")
955
--> 956 X = check_array(
957 X,
958 accept_sparse=accept_sparse,
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
736 array = array.astype(dtype, casting="unsafe", copy=False)
737 else:
--> 738 array = np.asarray(array, order=order, dtype=dtype)
739 except ComplexWarning as complex_warning:
740 raise ValueError(
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Toyota'
1.2 What if there were missing values?
Fill them with some value (also known as imputation).
Remove the samples with missing data altogether.
Option 1: Fill missing data with Pandas
Option 2: Filling missing data and transforming categorical data with Scikit-Learn
Note: This section is different to the video. The video shows filling and transforming the entire dataset (X
) and although the techniques are correct, it's best to fill and transform training and test sets separately (as shown in the code below).
The main takeaways:
Split your data first (into train/test)
Fill/transform the training set and test sets separately
Thank you Robert for pointing this out.
Note: The 50 less values in the transformed data is because we dropped the rows (50 total) with missing values in the Price column.
2. Choosing the right estimator/algorithm for your problem
Some things to note:
Sklearn refers to machine learning models, algorithms as estimators.
Classification problem - predicting a category (heart disease or not)
Sometimes you'll see
clf
(short for classifier) used as a classification estimator
Regression problem - predicting a number (selling price of a car)
If you're working on a machine learning problem and looking to use Sklearn and not sure what model you should use, refer to the sklearn machine learning map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
2.1 Picking a machine learning model for a regression problem
Let's use the California Housing dataset - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html
What if Ridge
didn't work or the score didn't fit our needs?
Well, we could always try a different model...
How about we try an ensemble model (an ensemble is combination of smaller models to try and make better predictions than just a single model)?
Sklearn's ensemble models can be found here: https://scikit-learn.org/stable/modules/ensemble.html
2.2 Picking a machine learning model for a classification problem
Let's go to the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Consulting the map and it says to try LinearSVC
.
Tidbit:
3. Fit the model/algorithm on our data and use it to make predictions
3.1 Fitting the model to the data
Different names for:
X
= features, features variables, datay
= labels, targets, target variables
Random Forest model deep dive
These resources will help you understand what's happening inside the Random Forest models we've been using.
3.2 Make predictions using a machine learning model
2 ways to make predictions:
predict()
predict_proba()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-82-5908053f578c> in <module>
1 # Use a trained model to make predictions
----> 2 clf.predict(np.array([1, 7, 8, 3, 4])) # this doesn't work...
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
795 The predicted classes.
796 """
--> 797 proba = self.predict_proba(X)
798
799 if self.n_outputs_ == 1:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
837 check_is_fitted(self)
838 # Check data
--> 839 X = self._validate_X_predict(X)
840
841 # Assign chunk of trees to jobs
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
566 Validate X whenever one tries to predict, apply, predict_proba."""
567 check_is_fitted(self)
--> 568 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
569 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
570 raise ValueError("No support for np.int64 index based sparse matrices")
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
555 raise ValueError("Validation should be done on X, y or both.")
556 elif not no_val_X and no_val_y:
--> 557 X = check_array(X, **check_params)
558 out = X
559 elif no_val_X and not no_val_y:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
759 # If input is 1D raise error
760 if array.ndim == 1:
--> 761 raise ValueError(
762 "Expected 2D array, got 1D array instead:\narray={}.\n"
763 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[1. 7. 8. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Make predictions with predict_proba()
- use this if someone asks you "what's the probability your model is assigning to each prediction?"
predict()
can also be used for regression models.
4. Evaluating a machine learning model
Three ways to evaluate Scikit-Learn models/estimators:
Estimator's built-in
score()
methodThe
scoring
parameterProblem-specific metric functions
You can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html
4.1 Evaluating a model with the score
method
Let's use the score()
on our regression problem...
4.2 Evaluating a model using the scoring
parameter
4.2.1 Classification model evaluation metrics
Accuracy
Area under ROC curve
Confusion matrix
Classification report
Accuracy
Area under the receiver operating characteristic curve (AUC/ROC)
Area under curve (AUC)
ROC curve
ROC curves are a comparison of a model's true postive rate (tpr) versus a models false positive rate (fpr).
True positive = model predicts 1 when truth is 1
False positive = model predicts 1 when truth is 0
True negative = model predicts 0 when truth is 0
False negative = model predicts 0 when truth is 1
Confusion matrix
The next way to evaluate a classification model is by using a confusion matrix.
A confusion matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict. In essence, giving you an idea of where the model is getting confused.
Again, this is probably easier visualized.
One way to do it is with pd.crosstab()
.
Creating a confusion matrix using Scikit-Learn
Scikit-Learn has multiple different implementations of plotting confusion matrices:
sklearn.metrics.ConfusionMatrixDisplay.from_estimator(estimator, X, y)
- this takes a fitted estimator (like ourclf
model), features (X
) and labels (y
), it then uses the trained estimator to make predictions onX
and compares the predictions toy
by displaying a confusion matrix.sklearn.metrics.ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
- this takes truth labels and predicted labels and compares them by displaying a confusion matrix.
Note: Both of these methods/classes require Scikit-Learn 1.0+. To check your version of Scikit-Learn run:
If you don't have 1.0+, you can upgrade at: https://scikit-learn.org/stable/install.html
Classification Report
To summarize classification metrics:
Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
Precision and recall become more important when classes are imbalanced.
If false positive predictions are worse than false negatives, aim for higher precision.
If false negative predictions are worse than false positives, aim for higher recall.
F1-score is a combination of precision and recall.
4.2.2 Regression model evaluation metrics
Model evaluation metrics documentation - https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
The ones we're going to cover are:
R^2 (pronounced r-squared) or coefficient of determination
Mean absolute error (MAE)
Mean squared error (MSE)
R^2
What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, it's R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
Mean absolute error (MAE)
MAE is the average of the absolute differences between predictions and actual values.
It gives you an idea of how wrong your models predictions are.
Mean squared error (MSE)
MSE is the mean of the square of the errors between actual and predicted values.
4.2.3 Finally using the scoring
parameter
Let's see the scoring
parameter being using for a regression problem...
4.3 Using different evaluation metrics as Scikit-Learn functions
The 3rd way to evaluate scikit-learn machine learning models/estimators is to using the sklearn.metrics
module - https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
5. Improving a model
First predictions = baseline predictions. First model = baseline model.
From a data perspective:
Could we collect more data? (generally, the more data, the better)
Could we improve our data?
From a model perspective:
Is there a better model we could use?
Could we improve the current model?
Hyperparameters vs. Parameters
Parameters = model find these patterns in data
Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns
Three ways to adjust hyperparameters:
By hand
Randomly with RandomSearchCV
Exhaustively with GridSearchCV
5.1 Tuning hyperparameters by hand
Let's make 3 sets, training, validation and test.
We're going to try and adjust:
max_depth
max_features
min_samples_leaf
min_samples_split
n_estimators
5.2 Hyperparameter tuning with RandomizedSearchCV
5.3 Hyperparameter tuning with GridSearchCV
Let's compare our different models metrics.
6. Saving and loading trained machine learning models
Two ways to save and load machine learning models:
With Python's
pickle
moduleWith the
joblib
module
Pickle
Joblib
7. Putting it all together!
Steps we want to do (all in one cell):
Fill missing data
Convert data to numbers
Build a model on the data
It's also possible to use GridSearchCV
or RandomizedSesrchCV
with our Pipeline
.