Path: blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-video-OLD.ipynb
874 views
Introduction to Scikit-Learn (sklearn)
This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.
What we're going to cover:
0. An end-to-end Scikit-Learn workflow
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-7cea9660990e> in <module>
1 # make a prediction
----> 2 y_label = clf.predict(np.array([0, 2, 3, 4]))
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
610 The predicted classes.
611 """
--> 612 proba = self.predict_proba(X)
613
614 if self.n_outputs_ == 1:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
654 check_is_fitted(self)
655 # Check data
--> 656 X = self._validate_X_predict(X)
657
658 # Assign chunk of trees to jobs
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
410 check_is_fitted(self)
411
--> 412 return self.estimators_[0]._validate_X_predict(X, check_input=True)
413
414 @property
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
378 """Validate X whenever one tries to predict, apply, predict_proba"""
379 if check_input:
--> 380 X = check_array(X, dtype=DTYPE, accept_sparse="csr")
381 if issparse(X) and (X.indices.dtype != np.intc or
382 X.indptr.dtype != np.intc):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
554 "Reshape your data either using array.reshape(-1, 1) if "
555 "your data has a single feature or array.reshape(1, -1) "
--> 556 "if it contains a single sample.".format(array))
557
558 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
1. Getting our data ready to be used with machine learning
Three main things we have to do: 1. Split the data into features and labels (usually X
& y
) 2. Filling (also called imputing) or disregarding missing values 3. Converting non-numerical values to numerical values (also called feature encoding)
1.1 Make sure it's all numerical
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-2eeea2d0b490> in <module>
3
4 model = RandomForestRegressor()
----> 5 model.fit(X_train, y_train)
6 model.score(X_test, y_test)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
293 """
294 # Validate or convert input data
--> 295 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
296 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
297 if sample_weight is not None:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Toyota'
1.2 What if there were missing values?
Fill them with some value (also known as imputation).
Remove the samples with missing data altogether.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-f532939289ac> in <module>
11 remainder="passthrough")
12
---> 13 transformed_X = transformer.fit_transform(X)
14 transformed_X
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
516 self._validate_remainder(X)
517
--> 518 result = self._fit_transform(X, y, _fit_transform_one)
519
520 if not result:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
455 message=self._log_message(name, idx, len(transformers)))
456 for idx, (name, trans, column, weight) in enumerate(
--> 457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
459 if "Expected 2D array, got 1D array instead" in str(e):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
370 """
371 self._validate_keywords()
--> 372 return super().fit_transform(X, y)
373
374 def transform(self, X):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
345 """
346 self._validate_keywords()
--> 347 self._fit(X, handle_unknown=self.handle_unknown)
348 self.drop_idx_ = self._compute_drop_idx()
349 return self
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
72
73 def _fit(self, X, handle_unknown='error'):
---> 74 X_list, n_samples, n_features = self._check_X(X)
75
76 if self.categories != 'auto':
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _check_X(self, X)
59 Xi = self._get_feature(X, feature_idx=i)
60 Xi = check_array(Xi, ensure_2d=False, dtype=None,
---> 61 force_all_finite=needs_validation)
62 X_columns.append(Xi)
63
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
63 elif X.dtype == np.dtype('object') and not allow_nan:
64 if _object_dtype_isnan(X).any():
---> 65 raise ValueError("Input contains NaN")
66
67
ValueError: Input contains NaN
Option 1: Fill missing data with Pandas
Option 2: Filling missing data and transforming categorical data with Scikit-Learn
Note: This section is different to the video. The video shows filling and transforming the entire dataset (X
) and although the techniques are correct, it's best to fill and transform training and test sets separately (as shown in the code below).
The main takeaways:
Split your data first (into train/test)
Fill/transform the training set and test sets separately
Thank you Robert for pointing this out.
Note: The 50 less values in the transformed data is because we dropped the rows (50 total) with missing values in the Price column.
2. Choosing the right estimator/algorithm for our problem
Scikit-Learn uses estimator as another term for machine learning model or algorithm.
Classification - predicting whether a sample is one thing or another
Regression - predicting a number
Step 1 - Check the Scikit-Learn machine learning map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
2.1 Picking a machine learning model for a regression problem
How do we improve this score?
What if Ridge wasn't working?
Let's refer back to the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
2.2 Choosing an estimator for a classification problem
Let's go to the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-73-44f78f3704d4> in <module>
----> 1 heart_disease = pd.read_csv("data/heart-disease.csv")
2 heart_disease.head()
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
446
447 # Create the parser.
--> 448 parser = TextFileReader(fp_or_buf, **kwds)
449
450 if chunksize or iterator:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
878 self.options["has_index_names"] = kwds["has_index_names"]
879
--> 880 self._make_engine(self.engine)
881
882 def close(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1112 def _make_engine(self, engine="c"):
1113 if engine == "c":
-> 1114 self._engine = CParserWrapper(self.f, **self.options)
1115 else:
1116 if engine == "python":
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1889 kwds["usecols"] = self.usecols
1890
-> 1891 self._reader = parsers.TextReader(src, **kwds)
1892 self.unnamed_cols = self._reader.unnamed_cols
1893
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] File data/heart-disease.csv does not exist: 'data/heart-disease.csv'
Consulting the map and it says to try LinearSVC
.
Tidbit:
3. Fit the model/algorithm on our data and use it to make predictions
3.1 Fitting the model to the data
Different names for:
X
= features, features variables, datay
= labels, targets, target variables
Random Forest model deep dive
These resources will help you understand what's happening inside the Random Forest models we've been using.
3.2 Make predictions using a machine learning model
2 ways to make predictions:
predict()
predict_proba()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-82-5908053f578c> in <module>
1 # Use a trained model to make predictions
----> 2 clf.predict(np.array([1, 7, 8, 3, 4])) # this doesn't work...
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
610 The predicted classes.
611 """
--> 612 proba = self.predict_proba(X)
613
614 if self.n_outputs_ == 1:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict_proba(self, X)
654 check_is_fitted(self)
655 # Check data
--> 656 X = self._validate_X_predict(X)
657
658 # Assign chunk of trees to jobs
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
410 check_is_fitted(self)
411
--> 412 return self.estimators_[0]._validate_X_predict(X, check_input=True)
413
414 @property
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
378 """Validate X whenever one tries to predict, apply, predict_proba"""
379 if check_input:
--> 380 X = check_array(X, dtype=DTYPE, accept_sparse="csr")
381 if issparse(X) and (X.indices.dtype != np.intc or
382 X.indptr.dtype != np.intc):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
554 "Reshape your data either using array.reshape(-1, 1) if "
555 "your data has a single feature or array.reshape(1, -1) "
--> 556 "if it contains a single sample.".format(array))
557
558 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead:
array=[1. 7. 8. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Make predictions with predict_proba()
- use this if someone asks you "what's the probability your model is assigning to each prediction?"
predict()
can also be used for regression models.
4. Evaluating a machine learning model
Three ways to evaluate Scikit-Learn models/esitmators:
Estimator
score
methodThe
scoring
parameterProblem-specific metric functions.
4.1 Evaluating a model with the score
method
Let's do the same but for regression...
4.2 Evaluating a model using the scoring
parameter
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-108-cca012993b3a> in <module>
1 # Default scoring parameter of classifier = mean accuracy
----> 2 clf.score()
TypeError: score() missing 2 required positional arguments: 'X' and 'y'
4.2.1 Classification model evaluation metrics
Accuracy
Area under ROC curve
Confusion matrix
Classification report
Accuracy
Area under the receiver operating characteristic curve (AUC/ROC)
Area under curve (AUC)
ROC curve
ROC curves are a comparison of a model's true postive rate (tpr) versus a models false positive rate (fpr).
True positive = model predicts 1 when truth is 1
False positive = model predicts 1 when truth is 0
True negative = model predicts 0 when truth is 0
False negative = model predicts 0 when truth is 1
Confusion Matrix
A confusion matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict.
In essence, giving you an idea of where the model is getting confused.
Note: In the original notebook, the function below had the "True label"
as the x-axis label and the "Predicted label"
as the y-axis label. But due to the way confusion_matrix()
outputs values, these should be swapped around. The code below has been corrected.
Classification Report
To summarize classification metrics:
Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
Precision and recall become more important when classes are imbalanced.
If false positive predictions are worse than false negatives, aim for higher precision.
If false negative predictions are worse than false positives, aim for higher recall.
F1-score is a combination of precision and recall.
4.2.2 Regression model evaluation metrics
Model evaluation metrics documentation - https://scikit-learn.org/stable/modules/model_evaluation.html
R^2 (pronounced r-squared) or coefficient of determination.
Mean absolute error (MAE)
Mean squared error (MSE)
R^2
What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, it's R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
Mean absolue error (MAE)
MAE is the average of the aboslute differences between predictions and actual values. It gives you an idea of how wrong your models predictions are.
Mean squared error (MSE)
4.2.3 Finally using the scoring
parameter
How about our regression model?
4.3 Using different evalution metrics as Scikit-Learn functions
Classification evaluation functions
Regression evaluation functions
5. Improving a model
First predictions = baseline predictions. First model = baseline model.
From a data perspective:
Could we collect more data? (generally, the more data, the better)
Could we improve our data?
From a model perspective:
Is there a better model we could use?
Could we improve the current model?
Hyperparameters vs. Parameters
Parameters = model find these patterns in data
Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns
Three ways to adjust hyperparameters:
By hand
Randomly with RandomSearchCV
Exhaustively with GridSearchCV
5.1 Tuning hyperparameters by hand
Let's make 3 sets, training, validation and test.
We're going to try and adjust:
max_depth
max_features
min_samples_leaf
min_samples_split
n_estimators
5.2 Hyperparameter tuning with RandomizedSearchCV
5.3 Hyperparameter tuning with GridSearchCV
Let's compare our different models metrics.
6. Saving and loading trained machine learning models
Two ways to save and load machine learning models:
With Python's
pickle
moduleWith the
joblib
module
Pickle
Joblib
7. Putting it all together!
Steps we want to do (all in one cell):
Fill missing data
Convert data to numbers
Build a model on the data
It's also possible to use GridSearchCV
or RandomizedSesrchCV
with our Pipeline
.