Path: blob/master/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-video.ipynb
874 views
🚜 Predicting the Sale Price of Bulldozers using Machine Learning
In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.
1. Problem defition
How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for?
2. Data
The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data
There are 3 main datasets:
Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
3. Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimises RMSLE.
4. Features
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing
Parsing dates
When we work with time series data, we want to enrich the time & date component as much as possible.
We can do that by telling pandas which of our columns has dates in it using the parse_dates
parameter.
Sort DataFrame by saledate
When working with time series data, it's a good idea to sort it by date.
Make a copy of the original DataFrame
We make a copy of the original dataframe so when we manipulate the copy, we've still got our original data.
Add datetime parameters for saledate
column
5. Modelling
We've done enough EDA (we could always do more) but let's start to do some model-driven EDA.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-cd4fbc98a101> in <module>
5 random_state=42)
6
----> 7 model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
293 """
294 # Validate or convert input data
--> 295 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
296 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
297 if sample_weight is not None:
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Low'
Convert string to categories
One way we can turn all of our data into numbers is by converting them into pandas catgories.
We can check the different datatypes compatible with pandas here: https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality
Thanks to pandas Categories we now have a way to access all of our data in the form of numbers.
But we still have a bunch of missing data...
Save preprocessed data
Fill missing values
Fill numerical missing values first
Filling and turning categorical variables into numbers
Now that all of data is numeric as well as our dataframe has no missing values, we should be able to build a machine learning model.
Question: Why doesn't the above metric hold water? (why isn't the metric reliable)
Splitting data into train/validation sets
Building an evaluation function
Testing our model on a subset (to tune the hyperparameters)
Hyerparameter tuning with RandomizedSearchCV
Train a model with the best hyperparamters
Note: These were found after 100 iterations of RandomizedSearchCV
.
Make predictions on test data
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-140-dcaddf54aa59> in <module>
1 # Make predictions on the test dataset
----> 2 test_preds = ideal_model.predict(df_test)
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
764 check_is_fitted(self)
765 # Check data
--> 766 X = self._validate_X_predict(X)
767
768 # Assign chunk of trees to jobs
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
410 check_is_fitted(self)
411
--> 412 return self.estimators_[0]._validate_X_predict(X, check_input=True)
413
414 @property
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
378 """Validate X whenever one tries to predict, apply, predict_proba"""
379 if check_input:
--> 380 X = check_array(X, dtype=DTYPE, accept_sparse="csr")
381 if issparse(X) and (X.indices.dtype != np.intc or
382 X.indptr.dtype != np.intc):
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Low'
Preprocessing the data (getting the test dataset in the same format as our training dataset)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-144-816969861579> in <module>
1 # Make predictions on updated test data
----> 2 test_preds = ideal_model.predict(df_test)
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
764 check_is_fitted(self)
765 # Check data
--> 766 X = self._validate_X_predict(X)
767
768 # Assign chunk of trees to jobs
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
410 check_is_fitted(self)
411
--> 412 return self.estimators_[0]._validate_X_predict(X, check_input=True)
413
414 @property
~/Desktop/ml-course/bulldozer-price-prediction-project/env/lib/python3.7/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
389 "match the input. Model n_features is %s and "
390 "input n_features is %s "
--> 391 % (self.n_features_, n_features))
392
393 return X
ValueError: Number of features of the model must match the input. Model n_features is 102 and input n_features is 101
Finally now our test dataframe has the same features as our training dataframe, we can make predictions!
We've made some predictions but they're not in the same format Kaggle is asking for: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
Feature Importance
Feature importance seeks to figure out which different attributes of the data were most importance when it comes to predicting the target variable (SalePrice).
Question to finish: Why might knowing the feature importances of a trained machine learning model be helpful?
Final challenge/extension: What other machine learning models could you try on our dataset? Hint: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html check out the regression section of this map, or try to look at something like CatBoost.ai or XGBooost.ai.