GitHub Repository: mrdbourke/zero-to-mastery-ml
Path: blob/master/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb
⁸⁷⁴ views

Kernel: Python 3 (ipykernel)

View source code | Read notebook in online book format

Predicting the Sale Price of Bulldozers using Machine Learning 🚜

In this notebook, we're going to go through an example machine learning project to use the characteristics of bulldozers and their past sales prices to predict the sale price of future bulldozers based on their characteristics.

Inputs: Bulldozer characteristics such as make year, base model, model series, state of sale (e.g. which US state was it sold in), drive system and more.
Outputs: Bulldozer sale price (in USD).

Since we're trying to predict a number, this kind of problem is known as a regression problem.

And since we're going to predicting results with a time component (predicting future sales based on past sales), this is also known as a time series or forecasting problem.

The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the Kaggle Bluebook for Bulldozers competition.

The techniques used in here have been inspired and adapted from the fast.ai machine learning course.

Overview

Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.

| | |:--😐 | 6 Step Machine Learning Modelling Framework (read more) |

To work through these topics, we'll use pandas, Matplotlib and NumPy for data analysis, as well as, Scikit-Learn for machine learning and modelling tasks.

| | |:--😐 | Tools that can be used for each step of the machine learning modelling process. |

We'll work through each step and by the end of the notebook, we'll have a trained machine learning model which predicts the sale price of a bulldozer given different characteristics about it.

6 Step Machine Learning Framework

1. Problem Definition

For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,

How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?

2. Data

Looking at the dataset from Kaggle we see that it contains historical sales data of bulldozers. Including things like, model type, size, sale date and more.

There are 3 datasets:

Train.csv - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including SalePrice which is the target variable).
Valid.csv - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as Train.csv).
Test.csv - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the SalePrice attribute, as this is what we'll be trying to predict).

Note: You can download the dataset bluebook-for-bulldozers dataset directly from Kaggle. Alternatively, you can also download it directly from the course GitHub.

3. Evaluation

For this problem, Kaggle has set the evaluation metric to being root mean squared log error (RMSLE). As with many regression evaluations, the goal will be to get this value as low as possible (a low error value means our model's predictions are close to what the real values are).

To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the Kaggle leaderboard.

4. Features

Features are different parts and attributes of the data.

During this step, you'll want to start finding out what you can about the data.

One of the most common ways to do this is to create a data dictionary.

For this dataset, Kaggle provides a data dictionary which contains information about what each attribute of the dataset means.

For example:

Variable Name	Description	Variable Type
SalesID	unique identifier of a particular sale of a machine at auction	Independent variable
MachineID	identifier for a particular machine; machines may have multiple sales	Independent variable
ModelID	identifier for a unique machine model (i.e. fiModelDesc)	Independent variable
datasource	source of the sale record; some sources are more diligent about reporting attributes of the machine than others. Note that a particular datasource may report on multiple auctioneerIDs.	Independent variable
auctioneerID	identifier of a particular auctioneer, i.e. company that sold the machine at auction. Not the same as datasource.	Independent variable
YearMade	year of manufacturer of the Machine	Independent variable
MachineHoursCurrentMeter	current usage of the machine in hours at time of sale (saledate); null or 0 means no hours have been reported for that sale	Independent variable
UsageBand	value (low, medium, high) calculated comparing this particular Machine-Sale hours to average usage for the fiBaseModel; e.g. 'Low' means this machine has fewer hours given its lifespan relative to the average of fiBaseModel.	Independent variable
Saledate	time of sale	Independent variable
fiModelDesc	Description of a unique machine model (see ModelID); concatenation of fiBaseModel & fiSecondaryDesc & fiModelSeries & fiModelDescriptor	Independent variable
State	US State in which sale occurred	Independent variable
Drive_System	machine configuration; typically describes whether 2 or 4 wheel drive	Independent variable
Enclosure	machine configuration - does the machine have an enclosed cab or not	Independent variable
Forks	machine configuration - attachment used for lifting	Independent variable
Pad_Type	machine configuration - type of treads a crawler machine uses	Independent variable
Ride_Control	machine configuration - optional feature on loaders to make the ride smoother	Independent variable
Transmission	machine configuration - describes type of transmission; typically automatic or manual	Independent variable
...	...	...
SalePrice	cost of sale in USD	Target/dependent variable

You can download the full version of this file directly from the Kaggle competition page (Kaggle account required) or view it on Google Sheets.

With all of this being known, let's get started!

First, we'll import the dataset and start exploring.

In [8]:

# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Out[8]:

Notebook last run (end-to-end): 2024-10-30 11:54:38.504966

1. Importing the data and preparing it for modelling

First thing is first, let's get the libraries we need imported and the data we'll need for the project.

We'll start by importing pandas, NumPy and matplotlib.

In [9]:

# Import data analysis tools 
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Print the versions we're using (as long as your versions are equal or higher than these, the code should work)
print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

Out[9]:

pandas version: 2.2.2
NumPy version: 2.1.1
matplotlib version: 3.9.2

Now we've got our tools for data analysis ready, we can import the data and start to explore it.

For this project, I've downloaded the data from Kaggle and stored it on the course GitHub under the file path ../data/bluebook-for-bulldozers.

We can write some code to check if the files are available locally (on our computer) and if not, we can download them.

Note: If you're running this notebook on Google Colab, the code below will enable you to download the dataset programmatically. Just beware that each time Google Colab shuts down, the data will have to be redownloaded. There's also an example Google Colab notebook showing how to download the data programmatically.

In [10]:

from pathlib import Path

# Check if 'bluebook-for-bulldozers' exists in the current or parent directory
# Link to data (see the file "bluebook-for-bulldozers"): https://github.com/mrdbourke/zero-to-mastery-ml/tree/master/data
dataset_dir = Path("../data/bluebook-for-bulldozers")
if not (dataset_dir.is_dir()):
    print(f"[INFO] Can't find existing 'bluebook-for-bulldozers' dataset in current directory or parent directory, downloading...")

    # Download and unzip the bluebook for bulldozers dataset
    !wget https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip
    !unzip bluebook-for-bulldozers.zip

    # Ensure a data directory exists and move the downloaded dataset there
    !mkdir ../data/
    !mv bluebook-for-bulldozers ../data/
    print(f"[INFO] Current dataset dir: {dataset_dir}")

    # Remove .zip file from notebook directory
    !rm -rf bluebook-for-bulldozers.zip
else:
    # If the target dataset directory exists, we don't need to download it
    print(f"[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!")
    print(f"[INFO] Current dataset dir: {dataset_dir}")

Out[10]:

[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!
[INFO] Current dataset dir: ../data/bluebook-for-bulldozers

Dataset downloaded!

Let's check what files are available.

In [11]:

import os

print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

Out[11]:

[INFO] Files/folders available in ../data/bluebook-for-bulldozers:

['random_forest_benchmark_test.csv',
 'Valid.csv',
 'median_benchmark.csv',
 'Valid.zip',
 'TrainAndValid.7z',
 'Test.csv',
 'predictions.csv',
 'Train.7z',
 'TrainAndValid_object_values_as_categories.parquet',
 'test_predictions.csv',
 'ValidSolution.csv',
 'train_tmp.csv',
 'Machine_Appendix.csv',
 'Train.csv',
 'Valid.7z',
 'TrainAndValid_object_values_as_categories.csv',
 'TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet',
 'Data Dictionary.xlsx',
 'TrainAndValid.csv',
 'Train.zip',
 'TrainAndValid.zip']

You can explore each of these files individually or read about them on the Kaggle Competition page.

For now, the main file we're interested in is TrainAndValid.csv (this is also a combination of Train.csv and Valid.csv), this is a combination of the training and validation datasets.

The training data (Train.csv) contains sale data from 1989 up to the end of 2011.
The validation data (Valid.csv) contains sale data from January 1, 2012 - April 30, 2012.
The test data (Test.csv) contains sale data from May 1, 2012 - November 2012.

We'll use the training data to train our model to predict the sale price of bulldozers, we'll then validate its performance on the validation data to see if our model can be improved in any way. Finally, we'll evaluate our best model on the test dataset.

But more on this later on.

Let's import the TrainAndValid.csv file and turn it into a pandas DataFrame.

In [12]:

# Import the training and validation set
df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/TrainAndValid.csv")

Out[12]:

/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/1127193594.py:2: DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/TrainAndValid.csv")

Wonderful! We've got our DataFrame ready to explore.

You might see a warning appear in the form:

DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv("../data/bluebook-for-bulldozers/TrainAndValid.csv")

This is just saying that some of our columns have multiple/mixed data types. For example, a column may contain strings but also contain integers. This is okay for now and can be addressed later on if necessary.

How about we get some information about our DataFrame?

In [13]:

# Get info about DataFrame
df.info()

Out[13]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 SalesID                   412698 non-null  int64  
 SalePrice                 412698 non-null  float64
 MachineID                 412698 non-null  int64  
 ModelID                   412698 non-null  int64  
 datasource                412698 non-null  int64  
 auctioneerID              392562 non-null  float64
 YearMade                  412698 non-null  int64  
 MachineHoursCurrentMeter  147504 non-null  float64
 UsageBand                 73670 non-null   object 
 saledate                  412698 non-null  object 
fiModelDesc               412698 non-null  object 
fiBaseModel               412698 non-null  object 
fiSecondaryDesc           271971 non-null  object 
fiModelSeries             58667 non-null   object 
fiModelDescriptor         74816 non-null   object 
ProductSize               196093 non-null  object 
fiProductClassDesc        412698 non-null  object 
state                     412698 non-null  object 
ProductGroup              412698 non-null  object 
ProductGroupDesc          412698 non-null  object 
Drive_System              107087 non-null  object 
Enclosure                 412364 non-null  object 
Forks                     197715 non-null  object 
Pad_Type                  81096 non-null   object 
Ride_Control              152728 non-null  object 
Stick                     81096 non-null   object 
Transmission              188007 non-null  object 
Turbocharged              81096 non-null   object 
Blade_Extension           25983 non-null   object 
Blade_Width               25983 non-null   object 
Enclosure_Type            25983 non-null   object 
Engine_Horsepower         25983 non-null   object 
Hydraulics                330133 non-null  object 
Pushblock                 25983 non-null   object 
Ripper                    106945 non-null  object 
Scarifier                 25994 non-null   object 
Tip_Control               25983 non-null   object 
Tire_Size                 97638 non-null   object 
Coupler                   220679 non-null  object 
Coupler_System            44974 non-null   object 
Grouser_Tracks            44875 non-null   object 
Hydraulics_Flow           44875 non-null   object 
Track_Type                102193 non-null  object 
Undercarriage_Pad_Width   102916 non-null  object 
Stick_Length              102261 non-null  object 
Thumb                     102332 non-null  object 
Pattern_Changer           102261 non-null  object 
Grouser_Type              102193 non-null  object 
Backhoe_Mounting          80712 non-null   object 
Blade_Type                81875 non-null   object 
Travel_Controls           81877 non-null   object 
Differential_Type         71564 non-null   object 
Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB

Woah! Over 400,000 entries!

That's a much larger dataset than what we've worked with before.

One thing you might have noticed is that the saledate column value is being treated as a Python object (it's okay if you didn't notice, these things take practice).

When the Dtype is object, it's saying that it's a string.

However, when we look at it...

In [15]:

df["saledate"][:10]

Out[15]:

  11/16/2006 0:00
   3/26/2004 0:00
   2/26/2004 0:00
   5/19/2011 0:00
   7/23/2009 0:00
  12/18/2008 0:00
   8/26/2004 0:00
  11/17/2005 0:00
   8/27/2009 0:00
    8/9/2007 0:00
Name: saledate, dtype: object

We can see that these object's are in the form of dates.

Since we're working on a time series problem (a machine learning problem with a time component), it's probably worth it to turn these strings into Python datetime objects.

Before we do, let's try visualize our saledate column against our SalePrice column.

To do so, we can create a scatter plot.

And to prevent our plot from being too big, how about we visualize the first 1000 values?

In [16]:

fig, ax = plt.subplots()
ax.scatter(x=df["saledate"][:1000], # visualize the first 1000 values
           y=df["SalePrice"][:1000])
ax.set_xlabel("Sale Date")
ax.set_ylabel("Sale Price ($)");

Out[16]:

Hmm... looks like the x-axis is quite crowded.

Maybe we can fix this by turning the saledate column into datetime format.

Good news is that is looks like our SalePrice column is already in float64 format so we can view its distribution directly from the DataFrame using a histogram plot.

In [17]:

# View SalePrice distribution 
df.SalePrice.plot.hist(xlabel="Sale Price ($)");

Out[17]:

1.1 Parsing dates

When working with time series data, it's a good idea to make sure any date data is the format of a datetime object (a Python data type which encodes specific information about dates).

We can tell pandas which columns to read in as dates by setting the parse_dates parameter in pd.read_csv.

Once we've imported our CSV with the saledate column parsed, we can view information about our DataFrame again with df.info().

In [18]:

df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False, # set low_memory=False to prevent mixed data types warning 
                 parse_dates=["saledate"]) # can use the parse_dates parameter and specify which column to treat as a date column

# With parse_dates... check dtype of "saledate"
df.info()

Out[18]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 SalesID                   412698 non-null  int64         
 SalePrice                 412698 non-null  float64       
 MachineID                 412698 non-null  int64         
 ModelID                   412698 non-null  int64         
 datasource                412698 non-null  int64         
 auctioneerID              392562 non-null  float64       
 YearMade                  412698 non-null  int64         
 MachineHoursCurrentMeter  147504 non-null  float64       
 UsageBand                 73670 non-null   object        
 saledate                  412698 non-null  datetime64[ns]
fiModelDesc               412698 non-null  object        
fiBaseModel               412698 non-null  object        
fiSecondaryDesc           271971 non-null  object        
fiModelSeries             58667 non-null   object        
fiModelDescriptor         74816 non-null   object        
ProductSize               196093 non-null  object        
fiProductClassDesc        412698 non-null  object        
state                     412698 non-null  object        
ProductGroup              412698 non-null  object        
ProductGroupDesc          412698 non-null  object        
Drive_System              107087 non-null  object        
Enclosure                 412364 non-null  object        
Forks                     197715 non-null  object        
Pad_Type                  81096 non-null   object        
Ride_Control              152728 non-null  object        
Stick                     81096 non-null   object        
Transmission              188007 non-null  object        
Turbocharged              81096 non-null   object        
Blade_Extension           25983 non-null   object        
Blade_Width               25983 non-null   object        
Enclosure_Type            25983 non-null   object        
Engine_Horsepower         25983 non-null   object        
Hydraulics                330133 non-null  object        
Pushblock                 25983 non-null   object        
Ripper                    106945 non-null  object        
Scarifier                 25994 non-null   object        
Tip_Control               25983 non-null   object        
Tire_Size                 97638 non-null   object        
Coupler                   220679 non-null  object        
Coupler_System            44974 non-null   object        
Grouser_Tracks            44875 non-null   object        
Hydraulics_Flow           44875 non-null   object        
Track_Type                102193 non-null  object        
Undercarriage_Pad_Width   102916 non-null  object        
Stick_Length              102261 non-null  object        
Thumb                     102332 non-null  object        
Pattern_Changer           102261 non-null  object        
Grouser_Type              102193 non-null  object        
Backhoe_Mounting          80712 non-null   object        
Blade_Type                81875 non-null   object        
Travel_Controls           81877 non-null   object        
Differential_Type         71564 non-null   object        
Steering_Controls         71522 non-null   object        
dtypes: datetime64[ns](1), float64(3), int64(5), object(44)
memory usage: 166.9+ MB

Nice!

Looks like our saledate column is now of type datetime64[ns], a NumPy-specific datetime format with high precision.

Since pandas works well with NumPy, we can keep it in this format.

How about we view a few samples from our SaleDate column again?

In [19]:

df["saledate"][:10]

Out[19]:

 2006-11-16
 2004-03-26
 2004-02-26
 2011-05-19
 2009-07-23
 2008-12-18
 2004-08-26
 2005-11-17
 2009-08-27
 2007-08-09
Name: saledate, dtype: datetime64[ns]

Beautiful! That's looking much better already.

We'll see how having our dates in this format is really helpful later on.

For now, how about we visualize our saledate column against our SalePrice column again?

In [20]:

fig, ax = plt.subplots()
ax.scatter(x=df["saledate"][:1000], # visualize the first 1000 values
           y=df["SalePrice"][:1000])
ax.set_xlabel("Sale Date")
ax.set_ylabel("Sale Price ($)");

Out[20]:

1.2 Sorting our DataFrame by saledate

Now we've formatted our saledate column to be NumPy datetime64[ns] objects, we can use built-in pandas methods such as sort_values to sort our DataFrame by date.

And considering this is a time series problem, sorting our DataFrame by date has the added benefit of making sure our data is sequential.

In other words, we want to use examples from the past (example sale prices from previous dates) to try and predict future bulldozer sale prices.

Let's use the pandas.DataFrame.sort_values method to sort our DataFrame by saledate in ascending order.

In [21]:

# Sort DataFrame in date order
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(10), df.saledate.tail(10)

Out[21]:

(205615   1989-01-17
 274835   1989-01-31
 141296   1989-01-31
 212552   1989-01-31
 62755    1989-01-31
 54653    1989-01-31
 81383    1989-01-31
 204924   1989-01-31
 135376   1989-01-31
 113390   1989-01-31
 Name: saledate, dtype: datetime64[ns],
 409202   2012-04-28
 408976   2012-04-28
 411695   2012-04-28
 411319   2012-04-28
 408889   2012-04-28
 410879   2012-04-28
 412476   2012-04-28
 411927   2012-04-28
 407124   2012-04-28
 409203   2012-04-28
 Name: saledate, dtype: datetime64[ns])

Nice!

Looks like our older samples are now coming first and the newer samples are towards the end of the DataFrame.

1.3 Adding extra features to our DataFrame

One way to potentially increase the predictive power of our data is to enhance it with more features.

This practice is known as feature engineering, taking existing features and using them to create more/different features.

There is no set in stone way to do feature engineering and often it takes quite a bit of practice/exploration/experimentation to figure out what might work and what won't.

For now, we'll use our saledate column to add extra features such as:

Year of sale
Month of sale
Day of sale
Day of week sale (e.g. Monday = 1, Tuesday = 2)
Day of year sale (e.g. January 1st = 1, January 2nd = 2)

Since we're going to be manipulating the data, we'll make a copy of the original DataFrame and perform our changes there.

This will keep the original DataFrame in tact if we need it again.

In [22]:

# Make a copy of the original DataFrame to perform edits on
df_tmp = df.copy()

Because we imported the data using read_csv() and we asked pandas to parse the dates using parase_dates=["saledate"], we can now access the different datetime attributes of the saledate column.

Let's use these attributes to add a series of different feature columns to our dataset.

After we've added these extra columns, we can remove the original saledate column as its information will be dispersed across these new columns.

In [23]:

# Add datetime parameters for saledate
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayofweek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayofyear"] = df_tmp.saledate.dt.dayofyear

# Drop original saledate column
df_tmp.drop("saledate", axis=1, inplace=True)

We could add more of these style of columns, such as, whether it was the start or end of a quarter (the sale being at the end of a quarter may bye influenced by things such as quarterly budgets) but these will do for now.

Challenge: See what other datetime attributes you can add to df_tmp using a similar technique to what we've used above. Hint: check the bottom of the pandas.DatetimeIndex docs.

How about we view some of our newly created columns?

In [24]:

# View newly created columns
df_tmp[["SalePrice", "saleYear", "saleMonth", "saleDay", "saleDayofweek", "saleDayofyear"]].head()

Out[24]:

Cool!

Now we've broken our saledate column into columns/features, we can perform further exploratory analysis such as visualizing the SalePrice against the saleMonth.

How about we view the first 10,000 samples (we could also randomly select 10,000 samples too) to see if reveals anything about which month has the highest sales?

In [25]:

# View 10,000 samples SalePrice against saleMonth
fig, ax = plt.subplots()
ax.scatter(x=df_tmp["saleMonth"][:10000], # visualize the first 10000 values
           y=df_tmp["SalePrice"][:10000])
ax.set_xlabel("Sale Month")
ax.set_ylabel("Sale Price ($)");

Out[25]:

Hmm... doesn't look like there's too much conclusive evidence here about which month has the highest sales value.

How about we plot the median sale price of each month?

We can do so by grouping on the saleMonth column with pandas.DataFrame.groupby and then getting the median of the SalePrice column.

In [26]:

# Group DataFrame by saleMonth and then find the median SalePrice
df_tmp.groupby(["saleMonth"])["SalePrice"].median().plot()
plt.xlabel("Month")
plt.ylabel("Median Sale Price ($)");

Out[26]:

Ohhh it looks like the median sale prices of January and February (months 1 and 2) are quite a bit higher than the other months of the year.

Could this be because of New Year budget spending?

Perhaps... but this would take a bit more investigation.

In the meantime, there are many other values we could look further into.

1.4 Inspect values of other columns

When first exploring a new problem, it's often a good idea to become as familiar with the data as you can.

Of course, with a dataset that has over 400,000 samples, it's unlikely you'll ever get through every sample.

But that's where the power of data analysis and machine learning can help.

We can use pandas to aggregate thousands of samples into smaller more managable pieces.

And as we'll see later on, we can use machine learning models to model the data and then later inspect which features the model thought were most important.

How about we see which states sell the most bulldozers?

In [27]:

# Check the different values of different columns
df_tmp.state.value_counts()[:10]

Out[27]:

state
Florida        67320
Texas          53110
California     29761
Washington     16222
Georgia        14633
Maryland       13322
Mississippi    13240
Ohio           12369
Illinois       11540
Colorado       11529
Name: count, dtype: int64

Woah! Looks like Flordia sells a fair few bulldozers.

How about we go even further and group our samples by state and then find the median SalePrice per state?

We also compare this to the median SalePrice for all samples.

In [28]:

# Group DataFrame by saleMonth and then find the median SalePrice per state as well as across the whole dataset
median_prices_by_state = df_tmp.groupby(["state"])["SalePrice"].median() # this will return a pandas Series rather than a DataFrame
median_sale_price = df_tmp["SalePrice"].median()

# Create a plot comparing median sale price per state to median sale price overall
plt.figure(figsize=(10, 7))
plt.bar(x=median_prices_by_state.index, # Because we're working with a Series, we can use the index (state names) as the x values
        height=median_prices_by_state.values)
plt.xlabel("State")
plt.ylabel("Median Sale Price ($)")
plt.xticks(rotation=90, fontsize=7);
plt.axhline(y=median_sale_price, 
            color="red", 
            linestyle="--", 
            label=f"Median Sale Price: ${median_sale_price:,.0f}")
plt.legend();

Out[28]:

Now that's a nice looking figure!

Interestingly Florida has the most sales and the median sale price is above the overall median of all other states.

And if you had a bulldozer and were chasing the highest sale price, the data would reveal that perhaps selling in South Dakota would be your best bet.

Perhaps bulldozers are in higher demand in South Dakota because of a building or mining boom?

Answering this would require a bit more research.

But what we're doing here is slowly building up a mental model of our data.

So that if we saw an example in the future, we could compare its values to the ones we've already seen.

2. Model driven data exploration

We've performed a small Exploratory Data Analysis (EDA) as well as enriched it with some datetime attributes, now let's try to model it.

Why model so early?

Well, we know the evaluation metric (root mean squared log error or RMSLE) we're heading towards.

We could spend more time doing EDA, finding more out about the data ourselves but what we'll do instead is use a machine learning model to help us do EDA whilst simultaneously working towards the best evaluation metric we can get.

Remember, one of the biggest goals of starting any new machine learning project is reducing the time between experiments.

Following the Scikit-Learn machine learning map and taking into account the fact we've got over 100,000 examples, we find a sklearn.linear_model.SGDRegressor or a sklearn.ensemble.RandomForestRegressor model might be a good candidate.

Since we're worked with the Random Forest algorithm before (on the heart disease classification problem), let's try it out on our regression problem.

Note: We're trying just one model here for now. But you can try many other kinds of models from the Scikit-Learn library, they mostly work with a similar API. There are even libraries such as LazyPredict which will try many models simultaneously and return a table with the results.

In [29]:

# This won't work since we've got missing numbers and categories
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)
model.fit(X=df_tmp.drop("SalePrice", axis=1), # use all columns except SalePrice as X input
          y=df_tmp.SalePrice) # use SalePrice column as y input

Out[29]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/2824176890.py in ?()
# This won't work since we've got missing numbers and categories
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)
----> 5 model.fit(X=df_tmp.drop("SalePrice", axis=1), # use all columns except SalePrice as X input
         y=df_tmp.SalePrice) # use SalePrice column as y input
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs)
               skip_parameter_validation=(
                   prefer_skip_nested_validation or global_skip_validation
               )
           ):
-> 1473                 return fit_method(estimator, *args, **kwargs)

~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X, y, sample_weight)
       # Validate or convert input data
       if issparse(y):
           raise ValueError("sparse multilabel-indicator for y is not supported.")

--> 363         X, y = self._validate_data(
           X,
           y,
           multi_output=True,
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
               if "estimator" not in check_y_params:
                   check_y_params = {**default_check_params, **check_y_params}
               y = check_array(y, input_name="y", **check_y_params)
           else:
--> 650                 X, y = check_X_y(X, y, **check_params)
           out = X, y

       if not no_val_X and check_params.get("ensure_2d", True):
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       raise ValueError(
           f"{estimator_name} requires y to be passed, but the target y is None"
       )

-> 1301     X = check_array(
       X,
       accept_sparse=accept_sparse,
       accept_large_sparse=accept_large_sparse,
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
                       )
                   array = xp.astype(array, dtype, copy=False)
               else:
                   array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
-> 1013             except ComplexWarning as complex_warning:
               raise ValueError(
                   "Complex data not supported\n{}\n".format(array)
               ) from complex_warning
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp, device)
       # Use NumPy API to support order
       if copy is True:
           array = numpy.array(array, order=order, dtype=dtype)
       else:
--> 751             array = numpy.asarray(array, order=order, dtype=dtype)

       # At this point array is a NumPy ndarray. We convert it to an array
       # container that is consistent with the input's namespace.
~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
   def __array__(
       self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
   ) -> np.ndarray:
       values = self._values
-> 2153         arr = np.asarray(values, dtype=dtype)
       if (
           astype_is_view(values.dtype, arr.dtype)
           and using_copy_on_write()
ValueError: could not convert string to float: 'Low'

Oh no!

When we try to fit our model to the data, we get a value error similar to:

ValueError: could not convert string to float: 'Low'

The problem here is that some of the features of our data are in string format and machine learning models love numbers.

Not to mention some of our samples have missing values.

And typically, machine learning models require all data to be in numerical format as well as all missing values to be filled.

Let's start to fix this by inspecting the different datatypes in our DataFrame.

We can do so using the pandas.DataFrame.info() method, this will give us the different datatypes as well as how many non-null (a null value is generally a missing value) in our df_tmp DataFrame.

Note: There are some ML models such as sklearn.ensemble.HistGradientBoostingRegressor, CatBoost and XGBoost which can handle missing values, however, I'll leave exploring each of these as extra-curriculum/extensions.

In [30]:

# Check for missing values and different datatypes 
df_tmp.info();

Out[30]:

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 SalesID                   412698 non-null  int64  
 SalePrice                 412698 non-null  float64
 MachineID                 412698 non-null  int64  
 ModelID                   412698 non-null  int64  
 datasource                412698 non-null  int64  
 auctioneerID              392562 non-null  float64
 YearMade                  412698 non-null  int64  
 MachineHoursCurrentMeter  147504 non-null  float64
 UsageBand                 73670 non-null   object 
 fiModelDesc               412698 non-null  object 
fiBaseModel               412698 non-null  object 
fiSecondaryDesc           271971 non-null  object 
fiModelSeries             58667 non-null   object 
fiModelDescriptor         74816 non-null   object 
ProductSize               196093 non-null  object 
fiProductClassDesc        412698 non-null  object 
state                     412698 non-null  object 
ProductGroup              412698 non-null  object 
ProductGroupDesc          412698 non-null  object 
Drive_System              107087 non-null  object 
Enclosure                 412364 non-null  object 
Forks                     197715 non-null  object 
Pad_Type                  81096 non-null   object 
Ride_Control              152728 non-null  object 
Stick                     81096 non-null   object 
Transmission              188007 non-null  object 
Turbocharged              81096 non-null   object 
Blade_Extension           25983 non-null   object 
Blade_Width               25983 non-null   object 
Enclosure_Type            25983 non-null   object 
Engine_Horsepower         25983 non-null   object 
Hydraulics                330133 non-null  object 
Pushblock                 25983 non-null   object 
Ripper                    106945 non-null  object 
Scarifier                 25994 non-null   object 
Tip_Control               25983 non-null   object 
Tire_Size                 97638 non-null   object 
Coupler                   220679 non-null  object 
Coupler_System            44974 non-null   object 
Grouser_Tracks            44875 non-null   object 
Hydraulics_Flow           44875 non-null   object 
Track_Type                102193 non-null  object 
Undercarriage_Pad_Width   102916 non-null  object 
Stick_Length              102261 non-null  object 
Thumb                     102332 non-null  object 
Pattern_Changer           102261 non-null  object 
Grouser_Type              102193 non-null  object 
Backhoe_Mounting          80712 non-null   object 
Blade_Type                81875 non-null   object 
Travel_Controls           81877 non-null   object 
Differential_Type         71564 non-null   object 
Steering_Controls         71522 non-null   object 
saleYear                  412698 non-null  int32  
saleMonth                 412698 non-null  int32  
saleDay                   412698 non-null  int32  
saleDayofweek             412698 non-null  int32  
saleDayofyear             412698 non-null  int32  
dtypes: float64(3), int32(5), int64(5), object(44)
memory usage: 174.7+ MB

Ok, it seems as though we've got a fair few different datatypes.

There are int64 types such as MachineID.

There are float64 types such as SalePrice.

And there are object (the object dtype can hold any Python object, including strings) types such as UseageBand.

Resource: You can see a list of all the pandas dtypes in the pandas user guide.

How about we find out how many missing values are in each column?

We can do so using pandas.DataFrame.isna() (isna stands for 'is null or NaN') which will return a boolean True/False if a value is missing (True if missing, False if not).

Let's start by checking the missing values in the head of our DataFrame.

In [31]:

# Find missing values in the head of our DataFrame 
df_tmp.head().isna()

Out[31]:

Alright it seems as though we've got some missing values in the MachineHoursCurrentMeter as well as the UsageBand and a few other columns.

But so far we've only viewed the first few rows.

It'll be very time consuming to go through each row one by one so how about we get the total missing values per column?

We can do so by calling .isna() on the whole DataFrame and then chaining it together with .sum().

Doing so will give us the total True/False values in a given column (when summing, True = 1, False = 0).

In [32]:

# Check for total missing values per column
df_tmp.isna().sum()

Out[32]:

SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
saleYear                         0
saleMonth                        0
saleDay                          0
saleDayofweek                    0
saleDayofyear                    0
dtype: int64

Woah! It looks like our DataFrame has quite a few missing values.

Not to worry, we can work on fixing this later on.

How about we start by tring to turn all of our data in numbers?

2.1 Inspecting the datatypes in our DataFrame

One way to help turn all of our data into numbers is to convert the columns with the object datatype into a category datatype using pandas.CategoricalDtype.

Note: There are many different ways to convert values into numbers. And often the best way will be specific to the value you're trying to convert. The method we're going to use, converting all objects (that are mostly strings) to categories is one of the faster methods as it makes a quick assumption that each unique value is its own number.

We can check the datatype of an individual column using the .dtype attribute and we can get its full name using .dtype.name.

In [33]:

# Get the dtype of a given column
df_tmp["UsageBand"].dtype, df_tmp["UsageBand"].dtype.name

Out[33]:

(dtype('O'), 'object')

Beautiful!

Now we've got a way to check a column's datatype individually.

There's also another group of methods to check a column's datatype directly.

For example, using pd.api.types.is_object_dtype(arr_or_dtype) we can get a boolean response as to whether the input is an object or not.

Note: There are many more of these checks you can perform for other datatypes such as strings under a similar name space pd.api.types.is_XYZ_dtype. See the pandas documentation for more.

Let's see how it works on our df_tmp["UsageBand"] column.

In [34]:

# Check whether a column is an object
pd.api.types.is_object_dtype(df_tmp["UsageBand"])

Out[34]:

True

We can also check whether a column is a string with pd.api.types.is_string_dtype(arr_or_dtype).

In [35]:

# Check whether a column is a string
pd.api.types.is_string_dtype(df_tmp["state"])

Out[35]:

True

Nice!

We can even loop through the items (columns and their labels) in our DataFrame using pandas.DataFrame.items() (in Python dictionary terms, calling .items() on a DataFrame will treat the column names as the keys and the column values as the values) and print out samples of columns which have the string datatype.

As an extra check, passing the sample to pd.api.types.infer_dtype() will return the datatype of the sample.

This will be a good way to keep exploring our data.

In [36]:

# Quick exampke of calling .items() on a dictionary
random_dict = {"key1": "hello",
               "key2": "world!"}

for key, value in random_dict.items():
    print(f"This is a key: {key}")
    print(f"This is a value: {value}")

Out[36]:

This is a key: key1
This is a value: hello
This is a key: key2
This is a value: world!

In [37]:

# Print column names and example content of columns which contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        # Check datatype of target column
        column_datatype = df_tmp[label].dtype.name

        # Get random sample from column values
        example_value = content.sample(1).values

        # Infer random sample datatype
        example_value_dtype = pd.api.types.infer_dtype(example_value)
        print(f"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}")

Out[37]:

Column name: fiModelDesc | Column dtype: object | Example value: ['35ZTS'] | Example value dtype: string
Column name: fiBaseModel | Column dtype: object | Example value: ['PC75'] | Example value dtype: string
Column name: fiProductClassDesc | Column dtype: object | Example value: ['Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth'] | Example value dtype: string
Column name: state | Column dtype: object | Example value: ['Florida'] | Example value dtype: string
Column name: ProductGroup | Column dtype: object | Example value: ['TTT'] | Example value dtype: string
Column name: ProductGroupDesc | Column dtype: object | Example value: ['Track Excavators'] | Example value dtype: string

Hmm... it seems that there are many more columns in the df_tmp with the object type that didn't display when checking for the string datatype (we know there are many object datatype columns in our DataFrame from using df_tmp.info()).

How about we try the same as above, except this time instead of pd.api.types.is_string_dtype, we use pd.api.types.is_object_dtype?

Let's try it.

In [38]:

# Start a count of how many object type columns there are
number_of_object_type_columns = 0

for label, content in df_tmp.items():
    # Check to see if column is of object type (this will include the string columns)
    if pd.api.types.is_object_dtype(content): 
        # Check datatype of target column
        column_datatype = df_tmp[label].dtype.name

        # Get random sample from column values
        example_value = content.sample(1).values

        # Infer random sample datatype
        example_value_dtype = pd.api.types.infer_dtype(example_value)
        print(f"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}")

        number_of_object_type_columns += 1

print(f"\n[INFO] Total number of object type columns: {number_of_object_type_columns}")

Out[38]:

Column name: UsageBand | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: fiModelDesc | Column dtype: object | Example value: ['590SUPER MII'] | Example value dtype: string
Column name: fiBaseModel | Column dtype: object | Example value: ['580'] | Example value dtype: string
Column name: fiSecondaryDesc | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: fiModelSeries | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: fiModelDescriptor | Column dtype: object | Example value: ['H'] | Example value dtype: string
Column name: ProductSize | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: fiProductClassDesc | Column dtype: object | Example value: ['Track Type Tractor, Dozer - 75.0 to 85.0 Horsepower'] | Example value dtype: string
Column name: state | Column dtype: object | Example value: ['Florida'] | Example value dtype: string
Column name: ProductGroup | Column dtype: object | Example value: ['TEX'] | Example value dtype: string
Column name: ProductGroupDesc | Column dtype: object | Example value: ['Skid Steer Loaders'] | Example value dtype: string
Column name: Drive_System | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Enclosure | Column dtype: object | Example value: ['EROPS'] | Example value dtype: string
Column name: Forks | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Pad_Type | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Ride_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Stick | Column dtype: object | Example value: ['Extended'] | Example value dtype: string
Column name: Transmission | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Turbocharged | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Blade_Extension | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Blade_Width | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Enclosure_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Engine_Horsepower | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Hydraulics | Column dtype: object | Example value: ['Standard'] | Example value dtype: string
Column name: Pushblock | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Ripper | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Scarifier | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Tip_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Tire_Size | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Coupler | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Coupler_System | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Grouser_Tracks | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Hydraulics_Flow | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Track_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Undercarriage_Pad_Width | Column dtype: object | Example value: ['20 inch'] | Example value dtype: string
Column name: Stick_Length | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Thumb | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Pattern_Changer | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Grouser_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Backhoe_Mounting | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Blade_Type | Column dtype: object | Example value: ['Straight'] | Example value dtype: string
Column name: Travel_Controls | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string
Column name: Differential_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty
Column name: Steering_Controls | Column dtype: object | Example value: [nan] | Example value dtype: empty

[INFO] Total number of object type columns: 44

Wonderful, looks like we've got sample outputs from all of the columns with the object datatype.

It also looks like that many of random samples are missing values.

2.2 Converting strings to categories with pandas

In pandas, one way to convert object/string values to numerical values is to convert them to categories or more specifically, the pd.CategoricalDtype datatype.

This datatype keeps the underlying data the same (e.g. doesn't change the string) but enables easy conversion to a numeric code using .cat.codes.

For example, the column state might have the values 'Alabama', 'Alaska', 'Arizona'... and these could be mapped to numeric values 1, 2, 3... respectively.

To see this in action, let's first convert the object datatype columns to "category" datatype.

We can do so by looping through the .items() of our DataFrame and reassigning each object datatype column using pandas.Series.astype(dtype="category").

In [39]:

# This will turn all of the object columns into category values
for label, content in df_tmp.items(): 
    if pd.api.types.is_object_dtype(content):
        df_tmp[label] = df_tmp[label].astype("category")

Wonderful!

Now let's check if it worked by calling .info() on our DataFrame.

In [40]:

df_tmp.info()

Out[40]:

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 SalesID                   412698 non-null  int64   
 SalePrice                 412698 non-null  float64 
 MachineID                 412698 non-null  int64   
 ModelID                   412698 non-null  int64   
 datasource                412698 non-null  int64   
 auctioneerID              392562 non-null  float64 
 YearMade                  412698 non-null  int64   
 MachineHoursCurrentMeter  147504 non-null  float64 
 UsageBand                 73670 non-null   category
 fiModelDesc               412698 non-null  category
fiBaseModel               412698 non-null  category
fiSecondaryDesc           271971 non-null  category
fiModelSeries             58667 non-null   category
fiModelDescriptor         74816 non-null   category
ProductSize               196093 non-null  category
fiProductClassDesc        412698 non-null  category
state                     412698 non-null  category
ProductGroup              412698 non-null  category
ProductGroupDesc          412698 non-null  category
Drive_System              107087 non-null  category
Enclosure                 412364 non-null  category
Forks                     197715 non-null  category
Pad_Type                  81096 non-null   category
Ride_Control              152728 non-null  category
Stick                     81096 non-null   category
Transmission              188007 non-null  category
Turbocharged              81096 non-null   category
Blade_Extension           25983 non-null   category
Blade_Width               25983 non-null   category
Enclosure_Type            25983 non-null   category
Engine_Horsepower         25983 non-null   category
Hydraulics                330133 non-null  category
Pushblock                 25983 non-null   category
Ripper                    106945 non-null  category
Scarifier                 25994 non-null   category
Tip_Control               25983 non-null   category
Tire_Size                 97638 non-null   category
Coupler                   220679 non-null  category
Coupler_System            44974 non-null   category
Grouser_Tracks            44875 non-null   category
Hydraulics_Flow           44875 non-null   category
Track_Type                102193 non-null  category
Undercarriage_Pad_Width   102916 non-null  category
Stick_Length              102261 non-null  category
Thumb                     102332 non-null  category
Pattern_Changer           102261 non-null  category
Grouser_Type              102193 non-null  category
Backhoe_Mounting          80712 non-null   category
Blade_Type                81875 non-null   category
Travel_Controls           81877 non-null   category
Differential_Type         71564 non-null   category
Steering_Controls         71522 non-null   category
saleYear                  412698 non-null  int32   
saleMonth                 412698 non-null  int32   
saleDay                   412698 non-null  int32   
saleDayofweek             412698 non-null  int32   
saleDayofyear             412698 non-null  int32   
dtypes: category(44), float64(3), int32(5), int64(5)
memory usage: 55.4 MB

It looks like it worked!

All of the object datatype columns now have the category datatype.

We can inspect this on a single column using pandas.Series.dtype.

In [41]:

# Check the datatype of a single column
df_tmp.state.dtype

Out[41]:

CategoricalDtype(categories=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
                  'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
                  'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
                  'Kentucky', 'Louisiana', 'Maine', 'Maryland',
                  'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
                  'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
                  'New Jersey', 'New Mexico', 'New York', 'North Carolina',
                  'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
                  'Puerto Rico', 'Rhode Island', 'South Carolina',
                  'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah',
                  'Vermont', 'Virginia', 'Washington', 'Washington DC',
                  'West Virginia', 'Wisconsin', 'Wyoming'],
, ordered=False, categories_dtype=object)

Excellent, notice how the column is now of type pd.CategoricalDtype.

We can also access these categories using pandas.Series.cat.categories.

In [42]:

# Get the category names of a given column
df_tmp.state.cat.categories

Out[42]:

Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'Washington DC', 'West Virginia', 'Wisconsin',
       'Wyoming'],
      dtype='object')

Finally, we can get the category codes (the numeric values representing the category) using pandas.Series.cat.codes.

In [43]:

# Inspect the category codes
df_tmp.state.cat.codes

Out[43]:

205615    43
274835     8
141296     8
212552     8
62755      8
          ..
410879     4
412476     4
411927     4
407124     4
409203     4
Length: 412698, dtype: int8

This gives us a numeric representation of our object/string datatype columns.

In [44]:

# Get example string using category number
target_state_cat_number = 43
target_state_cat_value = df_tmp.state.cat.categories[target_state_cat_number] 
print(f"[INFO] Target state category number {target_state_cat_number} maps to: {target_state_cat_value}")

Out[44]:

[INFO] Target state category number 43 maps to: Texas

Epic!

All of our data is categorical and thus we can now turn the categories into numbers, however it's still missing values, not to worry though, we'll get to these shortly.

2.3 Saving our preprocessed data (part 1)

Before we start doing any further preprocessing steps on our DataFrame, how about we save our current DataFrame to file so we could import it again later if necessary.

Saving and updating your dataset as you go is common practice in machine learning problems. As your problem changes and evolves, the dataset you're working with will likely change too.

Making checkpoints of your dataset is similar to making checkpoints of your code.

In [45]:

# Save preprocessed data to file
df_tmp.to_csv("../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv",
              index=False)

Now we've saved our preprocessed data to file, we can re-import it and make sure it's in the same format.

In [46]:

# Import preprocessed data to file
df_tmp = pd.read_csv("../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv",
                     low_memory=False)

df_tmp.head()

Out[46]:

Excellent, looking at the tale end (the far right side) our processed DataFrame has the columns we added to it (the extra data features) but it's still missing values.

But if we check df_tmp.info()...

In [47]:

df_tmp.info()

Out[47]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 SalesID                   412698 non-null  int64  
 SalePrice                 412698 non-null  float64
 MachineID                 412698 non-null  int64  
 ModelID                   412698 non-null  int64  
 datasource                412698 non-null  int64  
 auctioneerID              392562 non-null  float64
 YearMade                  412698 non-null  int64  
 MachineHoursCurrentMeter  147504 non-null  float64
 UsageBand                 73670 non-null   object 
 fiModelDesc               412698 non-null  object 
fiBaseModel               412698 non-null  object 
fiSecondaryDesc           271971 non-null  object 
fiModelSeries             58667 non-null   object 
fiModelDescriptor         74816 non-null   object 
ProductSize               196093 non-null  object 
fiProductClassDesc        412698 non-null  object 
state                     412698 non-null  object 
ProductGroup              412698 non-null  object 
ProductGroupDesc          412698 non-null  object 
Drive_System              107087 non-null  object 
Enclosure                 412364 non-null  object 
Forks                     197715 non-null  object 
Pad_Type                  81096 non-null   object 
Ride_Control              152728 non-null  object 
Stick                     81096 non-null   object 
Transmission              188007 non-null  object 
Turbocharged              81096 non-null   object 
Blade_Extension           25983 non-null   object 
Blade_Width               25983 non-null   object 
Enclosure_Type            25983 non-null   object 
Engine_Horsepower         25983 non-null   object 
Hydraulics                330133 non-null  object 
Pushblock                 25983 non-null   object 
Ripper                    106945 non-null  object 
Scarifier                 25994 non-null   object 
Tip_Control               25983 non-null   object 
Tire_Size                 97638 non-null   object 
Coupler                   220679 non-null  object 
Coupler_System            44974 non-null   object 
Grouser_Tracks            44875 non-null   object 
Hydraulics_Flow           44875 non-null   object 
Track_Type                102193 non-null  object 
Undercarriage_Pad_Width   102916 non-null  object 
Stick_Length              102261 non-null  object 
Thumb                     102332 non-null  object 
Pattern_Changer           102261 non-null  object 
Grouser_Type              102193 non-null  object 
Backhoe_Mounting          80712 non-null   object 
Blade_Type                81875 non-null   object 
Travel_Controls           81877 non-null   object 
Differential_Type         71564 non-null   object 
Steering_Controls         71522 non-null   object 
saleYear                  412698 non-null  int64  
saleMonth                 412698 non-null  int64  
saleDay                   412698 non-null  int64  
saleDayofweek             412698 non-null  int64  
saleDayofyear             412698 non-null  int64  
dtypes: float64(3), int64(10), object(44)
memory usage: 179.5+ MB

Hmm... what happened here?

Notice that all of the category datatype columns are back to the object datatype.

This is strange since we already converted the object datatype columns to category.

Well then why did they change back?

This happens because of the limitations of the CSV (.csv) file format, it doesn't preserve data types, rather it stores all the values as strings.

So when we read in a CSV, pandas defaults to interpreting strings as object datatypes.

Not to worry though, we can easily convert them to the category datatype as we did before.

Note: If you'd like to retain the datatypes when saving your data, you can use file formats such as parquet (Apache Parquet) and feather. These filetypes have several advantages over CSV in terms of processing speeds and storage size. However, data stored in these formats is not human-readable so you won't be able to open the files and inspect them without specific tools. For more on different file formats in pandas, see the IO tools documentation page.

In [48]:

for label, content in df_tmp.items():
    if pd.api.types.is_object_dtype(content):
        # Turn object columns into category datatype
        df_tmp[label] = df_tmp[label].astype("category")

Now if we wanted to preserve the datatypes of our data, we can save to parquet or feather format.

Let's try using parquet format.

To do so, we can use the pandas.DataFrame.to_parquet() method.

Files in the parquet format typically have the file extension of .parquet.

In [49]:

# To save to parquet format requires pyarrow or fastparquet (or both)
# Can install via `pip install pyarrow fastparquet`
df_tmp.to_parquet(path="../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet", 
                  engine="auto") # "auto" will automatically use pyarrow or fastparquet, defaulting to pyarrow first

Wonderful! Now let's try importing our DataFrame from the parquet format and check it using df_tmp.info().

In [50]:

# Read in df_tmp from parquet format
df_tmp = pd.read_parquet(path="../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet",
                         engine="auto")

# Using parquet format, datatypes are preserved
df_tmp.info()

Out[50]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 SalesID                   412698 non-null  int64   
 SalePrice                 412698 non-null  float64 
 MachineID                 412698 non-null  int64   
 ModelID                   412698 non-null  int64   
 datasource                412698 non-null  int64   
 auctioneerID              392562 non-null  float64 
 YearMade                  412698 non-null  int64   
 MachineHoursCurrentMeter  147504 non-null  float64 
 UsageBand                 73670 non-null   category
 fiModelDesc               412698 non-null  category
fiBaseModel               412698 non-null  category
fiSecondaryDesc           271971 non-null  category
fiModelSeries             58667 non-null   category
fiModelDescriptor         74816 non-null   category
ProductSize               196093 non-null  category
fiProductClassDesc        412698 non-null  category
state                     412698 non-null  category
ProductGroup              412698 non-null  category
ProductGroupDesc          412698 non-null  category
Drive_System              107087 non-null  category
Enclosure                 412364 non-null  category
Forks                     197715 non-null  category
Pad_Type                  81096 non-null   category
Ride_Control              152728 non-null  category
Stick                     81096 non-null   category
Transmission              188007 non-null  category
Turbocharged              81096 non-null   category
Blade_Extension           25983 non-null   category
Blade_Width               25983 non-null   category
Enclosure_Type            25983 non-null   category
Engine_Horsepower         25983 non-null   category
Hydraulics                330133 non-null  category
Pushblock                 25983 non-null   category
Ripper                    106945 non-null  category
Scarifier                 25994 non-null   category
Tip_Control               25983 non-null   category
Tire_Size                 97638 non-null   category
Coupler                   220679 non-null  category
Coupler_System            44974 non-null   category
Grouser_Tracks            44875 non-null   category
Hydraulics_Flow           44875 non-null   category
Track_Type                102193 non-null  category
Undercarriage_Pad_Width   102916 non-null  category
Stick_Length              102261 non-null  category
Thumb                     102332 non-null  category
Pattern_Changer           102261 non-null  category
Grouser_Type              102193 non-null  category
Backhoe_Mounting          80712 non-null   category
Blade_Type                81875 non-null   category
Travel_Controls           81877 non-null   category
Differential_Type         71564 non-null   category
Steering_Controls         71522 non-null   category
saleYear                  412698 non-null  int64   
saleMonth                 412698 non-null  int64   
saleDay                   412698 non-null  int64   
saleDayofweek             412698 non-null  int64   
saleDayofyear             412698 non-null  int64   
dtypes: category(44), float64(3), int64(10)
memory usage: 60.1 MB

Nice! Looks like using the parquet format preserved all of our datatypes.

For more on the parquet and feather formats, be sure to check out the pandas IO (input/output) documentation.

2.4 Finding and filling missing values

Let's remind ourselves of the missing values by getting the top 20 columns with the most missing values.

We do so by summing the results of pandas.DataFrame.isna() and then using sort_values(ascending=False) to showcase the rows with the most missing.

In [51]:

# Check missing values
df_tmp.isna().sum().sort_values(ascending=False)[:20]

Out[51]:

Blade_Width          386715
Engine_Horsepower    386715
Tip_Control          386715
Pushblock            386715
Blade_Extension      386715
Enclosure_Type       386715
Scarifier            386704
Hydraulics_Flow      367823
Grouser_Tracks       367823
Coupler_System       367724
fiModelSeries        354031
Steering_Controls    341176
Differential_Type    341134
UsageBand            339028
fiModelDescriptor    337882
Backhoe_Mounting     331986
Stick                331602
Turbocharged         331602
Pad_Type             331602
Blade_Type           330823
dtype: int64

Ok, it seems like there are a fair few columns with missing values and there are several datatypes across these columns (numerical, categorical).

How about we break the problem down and work on filling each datatype separately?

2.5 Filling missing numerical values

There's no set way to fill missing values in your dataset.

And unless you're filling the missing samples with newly discovered actual data, every way you fill your dataset's missing values will introduce some sort of noise or bias.

We'll start by filling the missing numerical values in ourdataet.

To do this, we'll first find the numeric datatype columns.

We can do by looping through the columns in our DataFrame and calling pd.api.types.is_numeric_dtype(arr_or_dtype) on them.

In [52]:

# Find numeric columns 
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        # Check datatype of target column
        column_datatype = df_tmp[label].dtype.name

        # Get random sample from column values
        example_value = content.sample(1).values

        # Infer random sample datatype
        example_value_dtype = pd.api.types.infer_dtype(example_value)
        print(f"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}")

Out[52]:

Column name: SalesID | Column dtype: int64 | Example value: [1748586] | Example value dtype: integer
Column name: SalePrice | Column dtype: float64 | Example value: [13000.] | Example value dtype: floating
Column name: MachineID | Column dtype: int64 | Example value: [1441940] | Example value dtype: integer
Column name: ModelID | Column dtype: int64 | Example value: [1333] | Example value dtype: integer
Column name: datasource | Column dtype: int64 | Example value: [132] | Example value dtype: integer
Column name: auctioneerID | Column dtype: float64 | Example value: [2.] | Example value dtype: floating
Column name: YearMade | Column dtype: int64 | Example value: [1000] | Example value dtype: integer
Column name: MachineHoursCurrentMeter | Column dtype: float64 | Example value: [nan] | Example value dtype: floating
Column name: saleYear | Column dtype: int64 | Example value: [2010] | Example value dtype: integer
Column name: saleMonth | Column dtype: int64 | Example value: [6] | Example value dtype: integer
Column name: saleDay | Column dtype: int64 | Example value: [16] | Example value dtype: integer
Column name: saleDayofweek | Column dtype: int64 | Example value: [3] | Example value dtype: integer
Column name: saleDayofyear | Column dtype: int64 | Example value: [285] | Example value dtype: integer

Beautiful! Looks like we've got a mixture of int64 and float64 numerical datatypes.

Now how about we find out which numeric columns are missing values?

We can do so by using pandas.isnull(obj).sum() to detect and sum the missing values in a given array-like object (in our case, the data in a target column).

Let's loop through our DataFrame columns, find the numeric datatypes and check if they have any missing values.

In [53]:

# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(f"Column name: {label} | Has missing values: {True}")
        else:
            print(f"Column name: {label} | Has missing values: {False}")

Out[53]:

Column name: SalesID | Has missing values: False
Column name: SalePrice | Has missing values: False
Column name: MachineID | Has missing values: False
Column name: ModelID | Has missing values: False
Column name: datasource | Has missing values: False
Column name: auctioneerID | Has missing values: True
Column name: YearMade | Has missing values: False
Column name: MachineHoursCurrentMeter | Has missing values: True
Column name: saleYear | Has missing values: False
Column name: saleMonth | Has missing values: False
Column name: saleDay | Has missing values: False
Column name: saleDayofweek | Has missing values: False
Column name: saleDayofyear | Has missing values: False

Okay, it looks like our auctioneerID and MachineHoursCurrentMeter columns have missing numeric values.

Let's have a look at how we might handle these.

2.6 Discussing possible ways to handle missing values

As previously discussed, there are many ways to fill missing values.

For missing numeric values, some potential options are:

Method	Pros	Cons
Fill with mean of column	- Easy to calculate/implement - Retains overall data distribution	- Averages out variation - Affected by outliers (e.g. if one value is much higher/lower than others)
Fill with median of column	- Easy to calculate/implement - Robust to outliers - Preserves center of data	- Ignores data distribution shape
Fill with mode of column	- Easy to calculate/implement - More useful for categorical-like data	- May not make sense for continuous/numerical data
Fill with 0 (or another constant)	- Simple to implement - Useful in certain contexts like counts	- Introduces bias (e.g. if 0 was a value that meant something) - Skews data (e.g. if many missing values, replacing all with 0 makes it look like that's the most common value)
Forward/Backward fill (use previous/future values to fill future/previous values)	- Maintains temporal continuity (for time series)	- Assumes data is continuous, which may not be valid
Use a calculation from other columns	- Takes existing information and reinterprets it	- Can result in unlikely outputs if calculations are not continuous
Interpolate (e.g. like dragging a cell in Excel/Google Sheets)	- Captures trends - Suitable for ordered data	- Can introduce errors - May assume linearity (data continues in a straight line)
Drop missing values	- Ensures complete data (only use samples with all information) - Useful for small datasets	- Can result in data loss (e.g. if many missing values are scattered across columns, data size can be dramatically reduced) - Reduces dataset size

Which method you choose will be dataset and problem dependant and will likely require several phases of experimentation to see what works and what doesn't.

For now, we'll fill our missing numeric values with the median value of the target column.

We'll also add a binary column (0 or 1) with rows reflecting whether or not a value was missing.

For example, MachineHoursCurrentMeter_is_missing will be a column with rows which have a value of 0 if that row's MachineHoursCurrentMeter column was not missing and 1 if it was.

In [54]:

# Fill missing numeric values with the median of the target column
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            
            # Add a binary column which tells if the data was missing our not
            df_tmp[label+"_is_missing"] = pd.isnull(content).astype(int) # this will add a 0 or 1 value to rows with missing values (e.g. 0 = not missing, 1 = missing)

            # Fill missing numeric values with median since it's more robust than the mean
            df_tmp[label] = content.fillna(content.median())

Why add a binary column indicating whether the data was missing or not?

We can easily fill all of the missing numeric values in our dataset with the median.

However, a numeric value may be missing for a reason.

Adding a binary column which indicates whether the value was missing or not helps to retain this information. It also means we can inspect these rows later on.

In [55]:

# Show rows where MachineHoursCurrentMeter_is_missing == 1
df_tmp[df_tmp["MachineHoursCurrentMeter_is_missing"] == 1].sample(5)

Out[55]:

Missing numeric values filled!

How about we check again whether or not the numeric columns have missing values?

In [56]:

# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(f"Column name: {label} | Has missing values: {True}")
        else:
            print(f"Column name: {label} | Has missing values: {False}")

Out[56]:

Column name: SalesID | Has missing values: False
Column name: SalePrice | Has missing values: False
Column name: MachineID | Has missing values: False
Column name: ModelID | Has missing values: False
Column name: datasource | Has missing values: False
Column name: auctioneerID | Has missing values: False
Column name: YearMade | Has missing values: False
Column name: MachineHoursCurrentMeter | Has missing values: False
Column name: saleYear | Has missing values: False
Column name: saleMonth | Has missing values: False
Column name: saleDay | Has missing values: False
Column name: saleDayofweek | Has missing values: False
Column name: saleDayofyear | Has missing values: False
Column name: auctioneerID_is_missing | Has missing values: False
Column name: MachineHoursCurrentMeter_is_missing | Has missing values: False

Woohoo! Numeric missing values filled!

And thanks to our binary _is_missing columns, we can even check how many were missing.

In [57]:

# Check to see how many examples in the auctioneerID were missing
df_tmp.auctioneerID_is_missing.value_counts()

Out[57]:

auctioneerID_is_missing
0    392562
1     20136
Name: count, dtype: int64

2.7 Filling missing categorical values with pandas

Now we've filled the numeric values, we'll do the same with the categorical values whilst ensuring that they are all numerical too.

Let's first investigate the columns which aren't numeric (we've already worked with these).

In [58]:

# Check columns which aren't numeric
print(f"[INFO] Columns which are not numeric:")
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(f"Column name: {label} | Column dtype: {df_tmp[label].dtype.name}")

Out[58]:

[INFO] Columns which are not numeric:
Column name: UsageBand | Column dtype: category
Column name: fiModelDesc | Column dtype: category
Column name: fiBaseModel | Column dtype: category
Column name: fiSecondaryDesc | Column dtype: category
Column name: fiModelSeries | Column dtype: category
Column name: fiModelDescriptor | Column dtype: category
Column name: ProductSize | Column dtype: category
Column name: fiProductClassDesc | Column dtype: category
Column name: state | Column dtype: category
Column name: ProductGroup | Column dtype: category
Column name: ProductGroupDesc | Column dtype: category
Column name: Drive_System | Column dtype: category
Column name: Enclosure | Column dtype: category
Column name: Forks | Column dtype: category
Column name: Pad_Type | Column dtype: category
Column name: Ride_Control | Column dtype: category
Column name: Stick | Column dtype: category
Column name: Transmission | Column dtype: category
Column name: Turbocharged | Column dtype: category
Column name: Blade_Extension | Column dtype: category
Column name: Blade_Width | Column dtype: category
Column name: Enclosure_Type | Column dtype: category
Column name: Engine_Horsepower | Column dtype: category
Column name: Hydraulics | Column dtype: category
Column name: Pushblock | Column dtype: category
Column name: Ripper | Column dtype: category
Column name: Scarifier | Column dtype: category
Column name: Tip_Control | Column dtype: category
Column name: Tire_Size | Column dtype: category
Column name: Coupler | Column dtype: category
Column name: Coupler_System | Column dtype: category
Column name: Grouser_Tracks | Column dtype: category
Column name: Hydraulics_Flow | Column dtype: category
Column name: Track_Type | Column dtype: category
Column name: Undercarriage_Pad_Width | Column dtype: category
Column name: Stick_Length | Column dtype: category
Column name: Thumb | Column dtype: category
Column name: Pattern_Changer | Column dtype: category
Column name: Grouser_Type | Column dtype: category
Column name: Backhoe_Mounting | Column dtype: category
Column name: Blade_Type | Column dtype: category
Column name: Travel_Controls | Column dtype: category
Column name: Differential_Type | Column dtype: category
Column name: Steering_Controls | Column dtype: category

Okay, we've got plenty of category type columns.

Let's now write some code to fill the missing categorical values as well as ensure they are numerical (non-string).

To do so, we'll:

Create a blank column to category dictionary, we'll use this to store categorical value names (e.g. their string name) as well as their categorical code. We'll end with a dictionary of dictionaries in the form {"column_name": {category_code: "category_value"...}...}.
Loop through the items in the DataFrame.
Check if the column is numeric or not.
Add a binary column in the form ORIGINAL_COLUMN_NAME_is_missing with a 0 or 1 value for if the row had a missing value.
Ensure the column values are in the pd.Categorical datatype and get their category codes with pd.Series.cat.codes (we'll add 1 to these values since pandas defaults to assigning -1 to NaN values, we'll use 0 instead).
Turn the column categories and column category codes from 5 into a dictionary with Python's dict(zip(category_names, category_codes)) and save this to the blank dictionary from 1 with the target column name as key.
Set the target column value to the numerical category values from 5.

Phew!

That's a fair few steps but nothing we can't handle.

Let's do it!

In [59]:

# 1. Create a dictionary to store column to category values (e.g. we turn our category types into numbers but we keep a record so we can go back)
column_to_category_dict = {} 

# 2. Turn categorical variables into numbers
for label, content in df_tmp.items():

    # 3. Check columns which *aren't* numeric
    if not pd.api.types.is_numeric_dtype(content):

        # 4. Add binary column to inidicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content).astype(int)

        # 5. Ensure content is categorical and get its category codes
        content_categories = pd.Categorical(content)
        content_category_codes = content_categories.codes + 1 # prevents -1 (the default for NaN values) from being used for missing values (we'll treat missing values as 0)

        # 6. Add column key to dictionary with code: category mapping per column
        column_to_category_dict[label] = dict(zip(content_category_codes, content_categories))
        
        # 7. Set the column to the numerical values (the category code value) 
        df_tmp[label] = content_category_codes

Ho ho! No errors!

Let's check out a few random samples of our DataFrame.

In [60]:

df_tmp.sample(5)

Out[60]:

Beautiful! Looks like our data is all in numerical form.

How about we investigate an item from our column_to_category_dict?

This will show the mapping from numerical value to category (most likely a string) value.

In [61]:

# Check the UsageBand (measure of bulldozer usage)
for key, value in sorted(column_to_category_dict["UsageBand"].items()): # note: calling sorted() on dictionary.items() sorts the dictionary by keys 
    print(f"{key} -> {value}")

Out[61]:

-> nan
-> High
-> Low
-> Medium

Note: Categorical values do not necessarily have order. They are strictly a mapping from number to value. In this case, our categorical values are mapped in numerical order. If you feel that the order of a value may influence a model in a negative way (e.g. 1 -> High is lower than 3 -> Medium but should be higher), you may want to look into ordering the values in a particular way or using a different numerical encoding technique such as one-hot encoding.

And we can do the same for the state column values.

In [62]:

# Check the first 10 state column values
for key, value in sorted(column_to_category_dict["state"].items())[:10]:
    print(f"{key} -> {value}")

Out[62]:

-> Alabama
-> Alaska
-> Arizona
-> Arkansas
-> California
-> Colorado
-> Connecticut
-> Delaware
-> Florida
-> Georgia

Beautiful!

How about we check to see all of the missing values have been filled?

In [63]:

# Check total number of missing values
total_missing_values = df_tmp.isna().sum().sum()

if total_missing_values == 0:
    print(f"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!")
else:
    print(f"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?")

Out[63]:

[INFO] Total missing values: 0 - Woohoo! Let's build a model!

2.8 Saving our preprocessed data (part 2)

One more step before we train new model!

Let's save our work so far so we could re-import our preprocessed dataset if we wanted to.

We'll save it to the parquet format again, this time with a suffix to show we've filled the missing values.

In [64]:

# Save preprocessed data with object values as categories as well as missing values filled
df_tmp.to_parquet(path="../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet",
                  engine="auto")

And to make sure it worked, we can re-import it.

In [65]:

# Read in preprocessed dataset
df_tmp = pd.read_parquet(path="../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet",
                         engine="auto")

Does it have any missing values?

In [66]:

# Check total number of missing values
total_missing_values = df_tmp.isna().sum().sum()

if total_missing_values == 0:
    print(f"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!")
else:
    print(f"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?")

Out[66]:

[INFO] Total missing values: 0 - Woohoo! Let's build a model!

Checkpoint reached!

We've turned all of our data into numbers as well as filled the missing values, time to try fitting a model to it again.

2.9 Fitting a machine learning model to our preprocessed data

Now all of our data is numeric and there are no missing values, we should be able to fit a machine learning model to it!

Let's reinstantiate our trusty sklearn.ensemble.RandomForestRegressor model.

Since our dataset has a substantial amount of rows (~400k+), let's first make sure the model will work on a smaller sample of 1000 or so.

Note: It's common practice on machine learning problems to see if your experiments will work on smaller scale problems (e.g. smaller amounts of data) before scaling them up to the full dataset. This practice enables you to try many different kinds of experiments with faster runtimes. The benefit of this is that you can figure out what doesn't work before spending more time on what does.

Our X values (features) will be every column except the SalePrice column.

And our y values (labels) will be the entirety of the SalePrice column.

We'll time how long our smaller experiment takes using the magic function %%time and placing it at the top of the notebook cell.

Note: You can find out more about the %%time magic command by typing %%time? (note the question mark on the end) in a notebook cell.

In [67]:

%%time

# Sample 1000 samples with random state 42 for reproducibility
df_tmp_sample_1k = df_tmp.sample(n=1000, random_state=42)

# Instantiate a model
model = RandomForestRegressor(n_jobs=-1) # use -1 to utilise all available processors

# Create features and labels
X_sample_1k = df_tmp_sample_1k.drop("SalePrice", axis=1) # use all columns except SalePrice as X values
y_sample_1k = df_tmp_sample_1k["SalePrice"] # use SalePrice as y values (target variable)

# Fit the model to the sample data
model.fit(X=X_sample_1k, 
          y=y_sample_1k)

Out[67]:

CPU times: user 1.06 s, sys: 2.37 s, total: 3.43 s
Wall time: 976 ms

Woah! It looks like things worked!

And quite quick too (since we're only using a relatively small number of rows).

How about we score our model?

We can do so using the built-in method score().

By default, sklearn.ensemble.RandomForestRegressor uses coefficient of determination ( $R^2$ or R-squared) as the evaluation metric (higher is better, with a score of 1.0 being perfect).

In [68]:

# Evaluate the model
model_sample_1k_score = model.score(X=X_sample_1k,
                                    y=y_sample_1k)

print(f"[INFO] Model score on {len(df_tmp_sample_1k)} samples: {model_sample_1k_score}")

Out[68]:

[INFO] Model score on 1000 samples: 0.9563062437082765

Wow, it looks like our model got a pretty good score on only 1000 samples (the best possible score it could achieve would've been 1.0).

How about we try our model on the whole dataset?

In [69]:

%%time

# Instantiate model
model = RandomForestRegressor(n_jobs=-1) # note: this could take quite a while depending on your machine (it took ~1.5 minutes on my MacBook Pro M1 Pro with 10 cores)

# Create features and labels with entire dataset
X_all = df_tmp.drop("SalePrice", axis=1)
y_all = df_tmp["SalePrice"]

# Fit the model
model.fit(X=X_all, 
          y=y_all)

Out[69]:

CPU times: user 10min 21s, sys: 8min 31s, total: 18min 53s
Wall time: 3min 24s

Ok, that took a little bit longer than fitting on 1000 samples (but that's too be expected, as many more calculations had to be made).

There's a reason we used n_jobs=-1 too.

If we stuck with the default of n_jobs=None (the same as n_jobs=1), it would've taken much longer.

Configuration (MacBook Pro M1 Pro, 10 Cores)	CPU Times (User)	CPU Times (Sys)	CPU Times (Total)	Wall Time
`n_jobs=-1` (all cores)	9min 14s	3.85s	9min 18s	1min 15s
`n_jobs=None` (default)	7min 14s	1.75s	7min 16s	7min 25s

And as we've discussed many times, one of the main goals when starting a machine learning project is to reduce your time between experiments.

How about we score the model trained on all of the data?

In [70]:

# Evaluate the model
model_sample_all_score = model.score(X=X_all,
                                     y=y_all)

print(f"[INFO] Model score on {len(df_tmp)} samples: {model_sample_all_score}")

Out[70]:

[INFO] Model score on 412698 samples: 0.9875710658782831

An even better score!

Oh wait...

Oh no...

I think we've got an error... (you might've noticed it already)

Why might this metric be unreliable?

Hint: Compare the data we trained on versus the data we evaluated on.

2.10 A big (but fixable) mistake

One of the hard things about bugs in machine learning projects is that they are often silent.

For example, our model seems to have fit the data with no issues and then evaluated with a good score.

So what's wrong?

It seems we've stumbled across one of the most common bugs in machine learning and that's data leakage (data from the training set leaking into the validation/testing sets).

We've evaluated our model on the same data it was trained on.

This isn't the model's fault either.

It's our fault.

Right back at the start we imported a file called TrainAndValid.csv, this file contains both the training and validation data.

And while we preprocessed it to make sure there were no missing values and the samples were all numeric, we never split the data into separate training and validation splits.

The right workflow would've been to train the model on the training split and then evaluate it on the unseen and separate validation split.

Our evaluation scores above are quite good but they can't necessarily be trusted to be replicated on unseen data (data in the real world) because they've been obtained by evaluating the model on data its already seen during training.

This would be the equivalent of a final exam at university containing all of the same questions as the practice exam without any changes, you may get a good grade, but does that good grade translate to the real world?

Not to worry, we can fix this!

How?

We can import the training and validation datasets separately via Train.csv and Valid.csv respectively.

Or we could import TrainAndValid.csv and perform the appropriate splits according the original Kaggle competition page (training data includes all samples prior to 2012 and validation data includes samples from January 1 2012 to April 30 2012).

In both methods, we'll have to perform similar preprocessing steps we've done so far.

Except because the validation data is supposed to remain as unseen data, we'll only use information from the training set to preprocess the validation set (and not mix the two).

We'll work on this in the subsequent sections.

The takeaway?

Always (if possible) create appropriate data splits at the start of a project.

Because it's one thing to train a machine learning model but if you can't evaluate it properly (on unseen data), how can you know how it'll perform (or may perform) in the real world on new and unseen data?

3. Splitting data into the right train/validation sets

The bad news is, we evaluated our model on the same data we trained it on.

The good news is, we get to practice importing and preprocessing our data again.

This time we'll make sure we've got separate training and validation splits.

Previously, we used pandas to ensure our data was all numeric and had no missing values.

And we can still use pandas for things such as creating/altering date-related columns.

But using pandas for all of our data preprocessing can be an issue with larger scale datasets or when new data is introduced.

How about this time we add Scikit-Learn to the mix and make a reproducible pipeline for our data preprocessing needs?

Note: Scikit-Learn has a fantastic guide on data transformations and in particular data preprocessing. I'd highly recommend spending an hour or so reading through this documentation, even if it doesn't make a lot of sense to begin with. Rest assured, with practice and experimentation you'll start to get the hang of it.

According to the Kaggle data page, the train, validation and test sets are split according to dates.

This makes sense since we're working on a time series problem (using past sale prices to try and predict future sale prices).

Knowing this, randomly splitting our data into train, validation and test sets using something like sklearn.model_selection.train_test_split() wouldn't work as this would mix samples from different dates in an unintended way.

Instead, we split our data into training, validation and test sets using the date each sample occured.

In our case:

Training data (Train.csv) = all samples up until 2011.
Validation data (Valid.csv) = all samples form January 1, 2012 - April 30, 2012.
Testing data (Test.csv) = all samples from May 1, 2012 - November 2012.

Previously we imported TrainAndValid.csv which is a combination of Train.csv and Valid.csv in one file.

We could split this based on the saledate column.

However, we could also import the Train.csv and Valid.csv files separately (we'll import Test.csv later on when we've trained a model).

We'll also import ValidSolution.csv which contains the SalePrice of Valid.csv and make sure we match the columns based on the SalesID key.

Note: For more on making good training, validation and test sets, check out the post How (and why) to create a good validation set by Rachel Thomas as well as The importance of a test set by Daniel Bourke.

In [71]:

# Import train samples (making sure to parse dates and then sort by them)
train_df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

# Import validation samples (making sure to parse dates and then sort by them)
valid_df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv
valid_solution = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/ValidSolution.csv")

# Map valid_solution to valid_df
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])

# Make sure valid_df is sorted by saledate still
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# How many samples are in each DataFrame?
print(f"[INFO] Number of samples in training DataFrame: {len(train_df)}")
print(f"[INFO] Number of samples in validation DataFrame: {len(valid_df)}")

Out[71]:

[INFO] Number of samples in training DataFrame: 401125
[INFO] Number of samples in validation DataFrame: 11573

In [72]:

# Let's check out the training DataFrame
train_df.sample(5)

Out[72]:

In [73]:

# And how about the validation DataFrame?
valid_df.sample(5)

Out[73]:

Nice!

We've now got separate training and validation datasets imported.

In a previous section, we created a function to decompose the saledate column into multiple features such as saleYear, saleMonth, saleDay and more.

Let's now replicate that function here and apply it to our train_df and valid_df.

In [74]:

# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

train_df = add_datetime_features_to_df(df=train_df)
valid_df = add_datetime_features_to_df(df=valid_df)

Wonderful, now let's make sure it worked by inspecting the last 5 columns of train_df.

In [75]:

# Display the last 5 columns (the recently added datetime breakdowns)
train_df.iloc[:, -5:].sample(5)

Out[75]:

Perfect! How about we try and fit a model?

3.1 Trying to fit a model on our training data

I'm a big fan of trying to fit a model on your dataset as early as possible.

If it works, you'll have to inspect and check its results.

And if it doesn't work, you'll get some insights into what you may have to do to your dataset to prepare it.

Let's turn our DataFrames into features (X) by dropping the SalePrice column (this is the value we're trying to predict) and labels (y) by extracting the SalePrice column.

Then we'll create a model using sklearn.ensemble.RandomForestRegressor and finally we'll try to fit it to only the training data.

In [76]:

# Split training data into features and labels
X_train = train_df.drop("SalePrice", axis=1)
y_train = train_df["SalePrice"]

# Split validation data into features and labels
X_valid = valid_df.drop("SalePrice", axis=1)
y_valid = valid_df["SalePrice"]

# Create a model
model = RandomForestRegressor(n_jobs=-1)

# Fit a model to the training data only
model.fit(X=X_train,
          y=y_train)

Out[76]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/150598518.py in ?()
# Create a model
model = RandomForestRegressor(n_jobs=-1)

# Fit a model to the training data only
---> 13 model.fit(X=X_train,
         y=y_train)
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs)
               skip_parameter_validation=(
                   prefer_skip_nested_validation or global_skip_validation
               )
           ):
-> 1473                 return fit_method(estimator, *args, **kwargs)

~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X, y, sample_weight)
       # Validate or convert input data
       if issparse(y):
           raise ValueError("sparse multilabel-indicator for y is not supported.")

--> 363         X, y = self._validate_data(
           X,
           y,
           multi_output=True,
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
               if "estimator" not in check_y_params:
                   check_y_params = {**default_check_params, **check_y_params}
               y = check_array(y, input_name="y", **check_y_params)
           else:
--> 650                 X, y = check_X_y(X, y, **check_params)
           out = X, y

       if not no_val_X and check_params.get("ensure_2d", True):
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       raise ValueError(
           f"{estimator_name} requires y to be passed, but the target y is None"
       )

-> 1301     X = check_array(
       X,
       accept_sparse=accept_sparse,
       accept_large_sparse=accept_large_sparse,
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
                       )
                   array = xp.astype(array, dtype, copy=False)
               else:
                   array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
-> 1013             except ComplexWarning as complex_warning:
               raise ValueError(
                   "Complex data not supported\n{}\n".format(array)
               ) from complex_warning
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp, device)
       # Use NumPy API to support order
       if copy is True:
           array = numpy.array(array, order=order, dtype=dtype)
       else:
--> 751             array = numpy.asarray(array, order=order, dtype=dtype)

       # At this point array is a NumPy ndarray. We convert it to an array
       # container that is consistent with the input's namespace.
~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
   def __array__(
       self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
   ) -> np.ndarray:
       values = self._values
-> 2153         arr = np.asarray(values, dtype=dtype)
       if (
           astype_is_view(values.dtype, arr.dtype)
           and using_copy_on_write()
ValueError: could not convert string to float: 'Medium'

Oh no!

We run into the error:

ValueError: could not convert string to float: 'Medium'

Hmm...

Where have we seen this error before?

It looks like since we re-imported our training dataset (from Train.csv) its no longer all numerical (hence the ValueError above).

Not to worry, we can fix this!

3.2 Encoding categorical features as numbers using Scikit-Learn

We've preprocessed our data previously with pandas.

And while this is a viable approach, how about we practice using another method?

This time we'll use Scikit-Learn's built-in preprocessing methods.

Why?

Because it's good exposure to different techniques.

And Scikit-Learn has many built-in helpful and well tested methods for preparing data.

You can also string together many of these methods and create a reusable pipeline (you can think of this pipeline as plumbing for data).

To preprocess our data with Scikit-Learn, we'll first define the numerical and categorical features of our dataset.

In [77]:

# Define numerical and categorical features
numerical_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]

print(f"[INFO] Numeric features: {numerical_features}")
print(f"[INFO] Categorical features: {categorical_features[:10]}...")

Out[77]:

[INFO] Numeric features: ['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleDay', 'saleDayofweek', 'saleDayofyear']
[INFO] Categorical features: ['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup']...

Nice!

We define our different feature types so we can use different preprocessing methods on each type.

Scikit-Learn has many built-in methods for preprocessing data under the sklearn.preprocessing module.

And I'd encourage you to spend some time reading the preprocessing data section of the Scikit-Learn user guide for more details.

For now, let's focus on turning our categorical features into numbers (from object/string datatype to numeric datatype).

The practice of turning non-numerical features into numerical features is often referred to as encoding.

There are several encoders available for different use cases.

Encoder	Description	Use case	For use on
`LabelEncoder`	Encode target labels with values between 0 and n_classes-1.	Useful for turning classification target values into numeric representations.	Target labels.
`OneHotEncoder`	Encode categorical features as a one-hot numeric array.	Turns every positive class of a unique category into a 1 and every negative class into a 0.	Categorical variables/features.
`OrdinalEncoder`	Encode categorical features as an integer array.	Turn unique categorical values into a range of integers, for example, 0 maps to 'cat', 1 maps to 'dog', etc.	Categorical variables/features.
`TargetEncoder`	Encode regression and classification targets into a shrunk estimate of the average target values for observations of the category.	Useful for converting targets into a certain range of values.	Target variables.

For our case, we're going to start with OrdinalEncoder.

When transforming/encoding values with Scikit-Learn, the steps as follows:

Instantiate an encoder, for example, sklearn.preprocessing.OrdinalEncoder.
Use the sklearn.preprocessing.OrdinalEncoder.fit method on the training data (this helps the encoder learn a mapping of categorical to numeric values).
Use the sklearn.preprocessing.OrdinalEncoder.transform method on the training data to apply the learned mapping from categorical to numeric values.
- Note: The sklearn.preprocessing.OrdinalEncoder.fit_transform method combines steps 1 & 2 into a single method.
Apply the learned mapping to subsequent datasets such as validation and test splits using sklearn.preprocessing.OrdinalEncoder.transform only.

Notice how the fit and fit_transform methods were reserved for the training dataset only.

This is because in practice the validation and testing datasets are meant to be unseen, meaning only information from the training dataset should be used to preprocess the validation/test datasets.

In short:

Instantiate an encoder such as sklearn.preprocessing.OrdinalEncoder.
Fit the encoder to and transform the training dataset categorical variables/features with sklearn.preprocessing.OrdinalEncoder.fit_transform.
Transform categorical variables/features from subsequent datasets such as the validation and test datasets with the learned encoding from step 2 using sklearn.preprocessing.OridinalEncoder.transform.
- Note: Notice the use of the transform method on validation/test datasets rather than fit_transform.

Let's do it!

We'll use the OrdinalEncoder class to fill any missing values with np.nan (NaN).

We'll also make sure to only use the OrdinalEncoder on the categorical features of our DataFrame.

Finally, the OrdinalEncoder expects all input variables to be of the same type (e.g. either numeric only or string only) so we'll make sure all the input variables are strings only using pandas.DataFrame.astype(str).

In [78]:

from sklearn.preprocessing import OrdinalEncoder

# 1. Create an ordinal encoder (turns category items into numeric representation)
ordinal_encoder = OrdinalEncoder(categories="auto",
                                 handle_unknown="use_encoded_value",
                                 unknown_value=np.nan,
                                 encoded_missing_value=np.nan) # treat unknown categories as np.nan (or None)

# 2. Fit and transform the categorical columns of X_train
X_train_preprocessed = X_train.copy() # make copies of the oringal DataFrames so we can keep the original values in tact and view them later
X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str)) # OrdinalEncoder expects all values as the same type (e.g. string or numeric only)

# 3. Transform the categorical columns of X_valid 
X_valid_preprocessed = X_valid.copy()
X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str)) # only use `transform` on the validation data

Wonderful!

Let's see if it worked.

First, we'll check out the original X_train DataFrame.

In [79]:

X_train.head()

Out[79]:

And how about X_train_preprocessed?

In [80]:

X_train_preprocessed.head()

Out[80]:

Beautiful!

Notice all of the non-numerical values in X_train have been converted to numerical values in X_train_preprocessed.

Now how about missing values?

Let's see the top 10 columns with the highest number of missing values from X_train.

In [81]:

X_train[categorical_features].isna().sum().sort_values(ascending=False)[:10]

Out[81]:

Engine_Horsepower    375906
Blade_Extension      375906
Tip_Control          375906
Pushblock            375906
Enclosure_Type       375906
Blade_Width          375906
Scarifier            375895
Hydraulics_Flow      357763
Grouser_Tracks       357763
Coupler_System       357667
dtype: int64

Ok, plenty of missing values.

How about X_train_preprocessed?

In [82]:

X_train_preprocessed[categorical_features].isna().sum().sort_values(ascending=False)[:10]

Out[82]:

UsageBand         0
fiModelDesc       0
Pushblock         0
Ripper            0
Scarifier         0
Tip_Control       0
Tire_Size         0
Coupler           0
Coupler_System    0
Grouser_Tracks    0
dtype: int64

Perfect! No missing values as well!

Now, what if we wanted to retrieve the original categorical values?

We can do using the OrdinalEncoder.categories_ attribute.

This will return the categories of each feature found during fit (or during fit_transform), the categories will be in the order of the features seen (same order as the columns of the DataFrame).

In [83]:

# Let's inspect the first three categories
ordinal_encoder.categories_[:3]

Out[83]:

[array(['High', 'Low', 'Medium', 'nan'], dtype=object),
 array(['100C', '104', '1066', ..., 'ZX800LC', 'ZX80LCK', 'ZX850H'],
       dtype=object),
 array(['10', '100', '104', ..., 'ZX80', 'ZX800', 'ZX850'], dtype=object)]

Since these come in the order of the features seen, we can create a mapping of these using the categorical column names of our DataFrame.

In [84]:

# Create a dictionary of dictionaries mapping column names and their variables to their numerical encoding
column_to_category_mapping = {}

for column_name, category_values in zip(categorical_features, ordinal_encoder.categories_):
    int_to_category = {i: category for i, category in enumerate(category_values)}
    column_to_category_mapping[column_name] = int_to_category

# Inspect an example column name to category mapping
column_to_category_mapping["UsageBand"]

Out[84]:

{0: 'High', 1: 'Low', 2: 'Medium', 3: 'nan'}

We can also reverse our OrdinalEncoder values with the inverse_transform() method.

This is helpful for reversing a preprocessing step or viewing the original data again if necessary.

In [85]:

# Create a copy of the preprocessed DataFrame
X_train_unprocessed = X_train_preprocessed[categorical_features].copy()

# This will return an array of the original untransformed data
X_train_unprocessed = ordinal_encoder.inverse_transform(X_train_unprocessed)

# Turn back into a DataFrame for viewing pleasure
X_train_unprocessed_df = pd.DataFrame(X_train_unprocessed, columns=categorical_features)

# Check out a sample
X_train_unprocessed_df.sample(5)

Out[85]:

Nice!

Now how about we try fitting a model again?

3.3 Fitting a model to our preprocessed training data

We've used Scikit-Learn to convert the categorical data in our training and validation sets into numbers.

But we haven't yet done anything with missing numerical values.

As it turns out, we can still try and fit a model.

Why?

Because there are several estimators/models in Scikit-Learn that can handle missing (NaN) values.

And our trusty sklearn.ensemble.RandomForestRegressor is one of them!

Let's try it out on our X_train_preprocessed DataFrame.

Note: For a list of all Scikit-Learn estimators that can handle NaN values, check out the Scikit-Learn imputation of missing values user guide.

In [86]:

%%time

# Instantiate a Random Forest Regression model
model = RandomForestRegressor(n_jobs=-1)

# Fit the model to the preprocessed training data
model.fit(X=X_train_preprocessed,
          y=y_train)

Out[86]:

CPU times: user 8min 54s, sys: 6min 26s, total: 15min 20s
Wall time: 2min 40s

It worked!

Now you might be thinking, "well if we could fit a model on a dataset with missing values, why did we bother filling them before?"

And that's a great question.

The main reason is to practice, practice, practice.

While there are some models which can handle missing values, others can't.

So it's good to have experience with both of these scenarios.

Let's see how our model scores on the validation set, data our model has never seen.

In [88]:

%%time

# Check model performance on the validation set
model.score(X=X_valid,
            y=y_valid)

Out[88]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed eval> in ?()
'Could not get source, probably due dynamically evaluated source code.'
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, sample_weight)
       """

       from .metrics import r2_score

--> 848         y_pred = self.predict(X)
       return r2_score(y, y_pred, sample_weight=sample_weight)
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X)
           The predicted values.
       """
       check_is_fitted(self)
       # Check data
-> 1063         X = self._validate_X_predict(X)

       # Assign chunk of trees to jobs
       n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X)
           force_all_finite = "allow-nan"
       else:
           force_all_finite = True

--> 641         X = self._validate_data(
           X,
           dtype=DTYPE,
           accept_sparse="csr",
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
               out = y
           else:
               out = X, y
       elif not no_val_X and no_val_y:
--> 633             out = check_array(X, input_name="X", **check_params)
       elif no_val_X and not no_val_y:
           out = _check_y(y, **check_params)
       else:
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
                       )
                   array = xp.astype(array, dtype, copy=False)
               else:
                   array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
-> 1013             except ComplexWarning as complex_warning:
               raise ValueError(
                   "Complex data not supported\n{}\n".format(array)
               ) from complex_warning
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp, device)
       # Use NumPy API to support order
       if copy is True:
           array = numpy.array(array, order=order, dtype=dtype)
       else:
--> 751             array = numpy.asarray(array, order=order, dtype=dtype)

       # At this point array is a NumPy ndarray. We convert it to an array
       # container that is consistent with the input's namespace.
~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
   def __array__(
       self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
   ) -> np.ndarray:
       values = self._values
-> 2153         arr = np.asarray(values, dtype=dtype)
       if (
           astype_is_view(values.dtype, arr.dtype)
           and using_copy_on_write()
ValueError: could not convert string to float: 'Low'

Oops!

Looks like we get an error:

ValueError: could not convert string to float: 'Low'

This is because we tried to evaluate our model on the original X_valid dataset which still contains strings rather than X_valid_preprocessed which contains all numerical values.

As we've discussed before, in machine learning problems, it's important to evaluate your models on data in the same format as they were trained on.

Knowing this, let's evaluate our model on our preprocessed validation dataset.

In [89]:

%%time

# Check model performance on the validation set
model.score(X=X_valid_preprocessed,
            y=y_valid)

Out[89]:

CPU times: user 766 ms, sys: 3.54 s, total: 4.3 s
Wall time: 1.27 s

0.8700295442271035

Excellent!

Now you might be wondering why this score ( $R^2$ or R-squared by default) is lower than the previous score of ~0.9875.

That's because this score is based on a model that has only seen the training data and is being evaluated on an unseen dataset (training on Train.csv, evaluating on Valid.csv).

Our previous score was from a model that had all of the evaluation samples in the training data (training and evaluating on TrainAndValid.csv).

So in practice, we would consider the most recent score as a much more reliable metric of how well our model might perform on future unseen data.

Just for fun, let's see how our model scores on the training dataset.

In [90]:

%%time

# Check model performance on the training set
model.score(X=X_train_preprocessed,
            y=y_train)

Out[90]:

CPU times: user 17.6 s, sys: 19.2 s, total: 36.8 s
Wall time: 7.42 s

0.9872786621410867

As expected our model performs better on the training set than the validation set.

It also scores much closer to the previous score of ~0.9875 we obtained when training and scoring on TrainAndValid.csv combined.

Note: It is common to see a model perform slightly worse on a validation/testing dataset than on a training set. This is because the model has seen all of the examples in the training set, where as, if done correctly, the validation and test sets are keep separate during training. So you would expect a model to do better on problems that it has seen before versus problems it hasn't. If you find your model scoring much higher on unseen data versus seen data (e.g. higher scores on the test set compared to the training set), you might want to inspect your data to make sure there isn't any leakage from the validation/test set into the training set.

4. Building an evaluation function

Evaluating a machine learning model is just as important as training one.

And so because of this, let's create an evaluation function to make evaluation faster and reproducible.

According to Kaggle for the Bluebook for Bulldozers competition, the evaluation function they use is root mean squared log error (RMSLE).

\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \log(1 + \hat{y}_i) - \log(1 + y_i) \right)^2}

Where:

$\hat{y}_i$ is the predicted value,
$y_i$ is the actual value,
$n$ is the number of observations.

Contrast this with mean absolute error (MAE), another common regression metric.

\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| \hat{y}_i - y_i \right|

With RMSLE, the relative error is more meaningful than the absolute error. You care more about ratios than absolute errors. For example, being off by $100 on a $1000 prediction (10% error) is more significant than being off by $100 on a $10,000 prediction (1% error). RMSLE is sensitive to large percentage errors.

Where as with MAE, is more about exact differences, a $100 prediction error is weighted the same regardless of the actual value.

In each of case, a lower value (closer to 0) is better.

For any problem, it's important to define the evaluation metric you're going to try and improve on.

In our case, let's create a function that calculates multiple evaluation metrics.

Namely, we'll use:

MAE (mean absolute error) via sklearn.metrics.mean_absolute_error - lower is better.
RMSLE (root mean squared log error) via sklearn.metrics.root_mean_squared_log_error - lower is better.
$R^2$ (R-squared or coefficient of determination) via the score method - higher is better.

For MAE and RMSLE we'll be comparing the model's predictions to the truth labels.

We can get an array of predicted values from our model using model.predict(X=features_to_predict_on).

In [93]:

# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error

# Create function to evaluate our model
def show_scores(model, 
                train_features=X_train_preprocessed,
                train_labels=y_train,
                valid_features=X_valid_preprocessed,
                valid_labels=y_valid):
    
    # Make predictions on train and validation features
    train_preds = model.predict(X=train_features)
    val_preds = model.predict(X=valid_features)

    # Create a scores dictionary of different evaluation metrics
    scores = {"Training MAE": mean_absolute_error(y_true=train_labels, 
                                                  y_pred=train_preds),
              "Valid MAE": mean_absolute_error(y_true=valid_labels, 
                                               y_pred=val_preds),
              "Training RMSLE": root_mean_squared_log_error(y_true=train_labels, 
                                                            y_pred=train_preds),
              "Valid RMSLE": root_mean_squared_log_error(y_true=valid_labels, 
                                                         y_pred=val_preds),
              "Training R^2": model.score(X=train_features, 
                                          y=train_labels),
              "Valid R^2": model.score(X=valid_features, 
                                       y=valid_labels)}
    return scores

Now that's a nice looking function!

How about we test it out?

In [94]:

# Try our model scoring function out
model_scores = show_scores(model=model)
model_scores

Out[94]:

{'Training MAE': np.float64(1596.4113176025767),
 'Valid MAE': np.float64(6172.124644142976),
 'Training RMSLE': np.float64(0.08546822305943352),
 'Valid RMSLE': np.float64(0.2576977236694938),
 'Training R^2': 0.9872786621410867,
 'Valid R^2': 0.8700295442271035}

Beautiful!

Now we can reuse this in the future for evaluating other models.

5. Tuning our model's hyperparameters

Hyperparameters are the settings we can change on our model.

And tuning hyperparameters on a given model can often alter its performance on a given dataset.

Ideally, changing hyperparameters would lead to better results.

However, it's often hard to know what hyperparameter changes would improve a model ahead of time.

So what we can do is run several experiments across various different hyperparameter settings and record which lead to the best results.

5.1 Making our modelling experiments faster (to speed up hyperparameter tuning)

Because of the size of our dataset (~400,000 rows), retraining an entire model (about 1-1.5 minutes on my MacBook Pro M1 Pro) for each new set of hyperparameters would take far too long to continuing experimenting as fast as we want to.

So what we'll do is take a sample of the training set and tune the hyperparameters on that before training a larger model.

Note: If you're experiments are taking longer than 10-seconds (or far longer than what you can interact with), you should be trying to speed things up. You can speed experiments up by sampling less data, using a faster computer or using a smaller model.

We can take a artificial sample of the training set by altering the number of samples seen by each n_estimator (an n_estimator is a decision tree a random forest will create during training, more trees generally leads to better performance but sacrifices compute time) in sklearn.ensemble.RandomForestRegressor using the max_samples parameter.

For example, setting max_samples to 10,000 means every n_estimator (default 100) in our RandomForestRegressor will only see 10,000 random samples from our DataFrame instead of the entire ~400,000.

In other words, we'll be looking at 40x less samples which means we should get faster computation speeds but we should also expect our results to worsen (because the model has less samples to learn patterns from).

Let's see if reducing the number samples speeds up our modelling time.

In [95]:

%%time

# Change max samples in RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, # this is the default
                              n_jobs=-1,
                              max_samples=10000) # each estimator sees max_samples (the default is to see all available samples)

# Cutting down the max number of samples each tree can see improves training time
model.fit(X_train_preprocessed, 
          y_train)

Out[95]:

CPU times: user 19.2 s, sys: 18.5 s, total: 37.6 s
Wall time: 7.53 s

Nice! That worked much faster than training on the whole dataset.

Let's evaluate our model with our show_scores function.

In [96]:

# Get evaluation metrics from reduced sample model
base_model_scores = show_scores(model=model)
base_model_scores

Out[96]:

{'Training MAE': np.float64(5605.344206319725),
 'Valid MAE': np.float64(7176.1651147786515),
 'Training RMSLE': np.float64(0.26030112528907273),
 'Valid RMSLE': np.float64(0.2935839690284876),
 'Training R^2': 0.858111849057448,
 'Valid R^2': 0.828549722372896}

Excellent! Even though our new model saw far less data than the previous model, it still looks to be performing quite well.

With this faster model, we can start to run a series of different hyperparameter experiments.

5.2 Hyperparameter tuning with RandomizedSearchCV

The goal of hyperparameter tuning is to values for our model's settings which lead to better results.

We could sit there and do this by hand, adjusting parameters on sklearn.ensemble.RandomForestRegressor such as n_estimators, max_depth, min_samples_split and more.

However, this would quite tedious.

Instead, we can define a dictionary of hyperparametmer settings in the form {"hyperparamter_name": [values_to_test]} and then use sklearn.model_selection.RandomizedSearchCV (randomly search for best combination of hyperparameters) or sklearn.model_selection.GridSearchCV (exhaustively search for best combination of hyperparameters) to go through all of these settings for us on a given model and dataset and then record which perform best.

A general workflow is to start with a large number and wide range of potential settings and use RandomizedSearchCV to search across these randomly for a limited number of iterations (e.g. n_iter=10).

And then take the best results and narrow the search space down before exhaustively search for the best hyperparameters with GridSearchCV.

Let's start trying to find better hyperparameters by:

Define a dictionary of hyperparameter values for our RandomForestRegressor model. We'll keep max_samples=10000 so our experiments run faster.
Setup an instance of RandomizedSearchCV to explore the parameter values defined in step 1. We can adjust how many sets of hyperparameters our model tries using the n_iter parameter as well as how many times our model performs cross-validation using the cv parameter. For example, setting n_iter=20 and cv=3 means there will be 3 cross-validation folds for each of the 20 different combinations of hyperparameters, a total of 60 (3*20) experiments.
Fit the instance of RandomizedSearchCV to the data. This will automatically go through the defined number of iterations and record the results for each. The best model gets loaded at the end.

Note: You can read more about the tuning of hyperparameters of an esimator/model in the Scikit-Learn user guide.

In [97]:

%%time

from sklearn.model_selection import RandomizedSearchCV

# 1. Define a dictionary with different values for RandomForestRegressor hyperparameters
# See documatation for potential different values - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html 
rf_grid = {"n_estimators": np.arange(10, 200, 10),
           "max_depth": [None, 10, 20],
           "min_samples_split": np.arange(2, 10, 1), # min_samples_split must be an int in the range [2, inf) or a float in the range (0.0, 1.0]
           "min_samples_leaf": np.arange(1, 10, 1),
           "max_features": [0.5, 1.0, "sqrt"], # Note: "max_features='auto'" is equivalent to "max_features=1.0", as of Scikit-Learn version 1.1
           "max_samples": [10000]}

# 2. Setup instance of RandomizedSearchCV to explore different parameters 
rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(), # can pass new model instance directly, all settings will be taken from the rf_grid
                              param_distributions=rf_grid,
                              n_iter=20,
                            #   scoring="neg_root_mean_squared_log_error", # want to optimize for RMSLE, though sometimes optimizing for the default metric (R^2) can lead to just as good results all round
                              cv=3,
                              verbose=3) # control how much output gets produced, higher number = more output

# 3. Fit the model using a series of different hyperparameter values
rs_model.fit(X=X_train_preprocessed, 
             y=y_train)

Out[97]:

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.539 total time=  21.6s
[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.720 total time=  23.0s
[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.596 total time=  22.2s
[CV 1/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.491 total time=   3.3s
[CV 2/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.655 total time=   3.4s
[CV 3/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.614 total time=   3.3s
[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.520 total time=   6.7s
[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.702 total time=   6.6s
[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.636 total time=   6.6s
[CV 1/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.512 total time=   3.0s
[CV 2/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.703 total time=   2.7s
[CV 3/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.636 total time=   2.7s
[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.541 total time=   9.9s
[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.745 total time=  11.1s
[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.632 total time=  10.2s
[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.529 total time=   6.1s
[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.713 total time=   5.5s
[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.625 total time=   5.2s
[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.532 total time=  13.9s
[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.712 total time=  14.6s
[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.631 total time=  14.2s
[CV 1/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.545 total time=   6.6s
[CV 2/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.767 total time=   6.5s
[CV 3/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.619 total time=   6.2s
[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.518 total time=   6.0s
[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.703 total time=   6.4s
[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.637 total time=   6.2s
[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.533 total time=  10.7s
[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.708 total time=  13.7s
[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.628 total time=  10.5s
[CV 1/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.542 total time=  20.0s
[CV 2/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.756 total time=  19.9s
[CV 3/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.611 total time=  17.1s
[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.524 total time=   9.1s
[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.706 total time=   8.7s
[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.637 total time=   8.6s
[CV 1/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.539 total time=  16.0s
[CV 2/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.754 total time=  14.8s
[CV 3/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.615 total time=  14.2s
[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.548 total time=   9.3s
[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.766 total time=   8.6s
[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.623 total time=   9.0s
[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.535 total time=  23.8s
[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.715 total time=  25.5s
[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.595 total time=  27.2s
[CV 1/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.544 total time=  12.0s
[CV 2/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.764 total time=  11.5s
[CV 3/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.642 total time=  10.4s
[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.538 total time=   8.2s
[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.752 total time=   8.4s
[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.640 total time=   8.5s
[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.537 total time=   9.8s
[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.747 total time=   9.5s
[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.630 total time=   9.7s
[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.536 total time=  28.3s
[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.721 total time=  28.6s
[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.597 total time=  28.1s
[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.539 total time=   9.0s
[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.733 total time=  10.7s
[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.643 total time=   8.9s
CPU times: user 8min 6s, sys: 25min 58s, total: 34min 4s
Wall time: 11min 54s

Phew! That's quite a bit of testing!

Good news for us is that we can check the best hyperparameters with the best_params_ attribute.

In [113]:

# Find the best parameters from RandomizedSearchCV
rs_model.best_params_

Out[113]:

{'n_estimators': np.int64(80),
 'min_samples_split': np.int64(3),
 'min_samples_leaf': np.int64(2),
 'max_samples': 10000,
 'max_features': 0.5,
 'max_depth': 20}

And we can evaluate this model with our show_scores function.

In [114]:

# Evaluate the RandomizedSearch model
rs_model_scores = show_scores(rs_model)
rs_model_scores

Out[114]:

{'Training MAE': np.float64(5804.886346446167),
 'Valid MAE': np.float64(7271.010705137403),
 'Training RMSLE': np.float64(0.2668477962708691),
 'Valid RMSLE': np.float64(0.2985683128197976),
 'Training R^2': 0.8494436266937344,
 'Valid R^2': 0.8280568050158131}

5.3 Training a model with the best hyperparameters

Like all good machine learning cooking shows, I prepared a model earlier.

I tried 100 different combinations of hyperparameters (setting n_iter=100 in RandomizedSearchCV) and found the best results came from the settings below.

n_estimators=90
max_depth=None
min_samples_leaf=1
min_samples_split=5
max_features=0.5
n_jobs=-1
max_samples=None

Note: This search (n_iter=100) took ~2-hours on my MacBook Pro M1 Pro. So it's kind of a set and come back later experiment. That's one of the things you'll have to get used to as a machine learning engineer, figuring out what to do whilst your model trains. I like to go for long walks or to the gym (rule of thumb: while my model trains, I train).

We'll instantiate a new model with these discovered hyperparameters and reset the max_samples back to its original value.

In [115]:

%%time

# Create a model with best found hyperparameters 
# Note: There may be better values out there with longer searches but these are 
# the best I found with a ~2 hour search. A good challenge would be to see if you 
# can find better values.
ideal_model = RandomForestRegressor(n_estimators=90,
                                    max_depth=None,
                                    min_samples_leaf=1,
                                    min_samples_split=5,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None)

# Fit a model to the preprocessed data
ideal_model.fit(X=X_train_preprocessed, 
                y=y_train)

Out[115]:

CPU times: user 4min 6s, sys: 4min 34s, total: 8min 40s
Wall time: 2min

And of course, we can evaluate our ideal_model with our show_scores function.

In [116]:

%%time

# Evaluate ideal model
ideal_model_scores = show_scores(model=ideal_model)
ideal_model_scores

Out[116]:

CPU times: user 28.8 s, sys: 37.4 s, total: 1min 6s
Wall time: 14.5 s

{'Training MAE': np.float64(1955.980118634043),
 'Valid MAE': np.float64(5979.47564414195),
 'Training RMSLE': np.float64(0.10224456852444506),
 'Valid RMSLE': np.float64(0.24733387014318542),
 'Training R^2': 0.9809704227866279,
 'Valid R^2': 0.8810497144604977}

Woohoo!

With these new hyperparameters as well as using all the samples, we can see an improvement to our models performance.

One thing to keep in mind is that a larger model isn't always the best for a given problem even if it performs better.

For example, you may require a model that performs inference (makes predictions) very fast with a slight tradeoff to performance.

One way to a faster model is by altering some of the hyperparameters to create a smaller overall model.

Particularly by lowering n_estimators since each increase in n_estimators is basically building another small model.

Let's half our n_estimators value and see how it goes.

In [117]:

%%time

# Halve the number of estimators
fast_model = RandomForestRegressor(n_estimators=45,
                                   max_depth=None,
                                   min_samples_leaf=1,
                                   min_samples_split=5,
                                   max_features=0.5,
                                   n_jobs=-1,
                                   max_samples=None)

# Fit the faster model to the data
fast_model.fit(X=X_train_preprocessed, 
               y=y_train)

Out[117]:

CPU times: user 2min, sys: 1min 58s, total: 3min 58s
Wall time: 44.9 s

Nice! The faster model fits to the training data in about half the time of the full model.

Now how does it go on performance?

In [118]:

%%time

# Get results from the fast model
fast_model_scores = show_scores(model=fast_model)
fast_model_scores

Out[118]:

CPU times: user 14.6 s, sys: 23.7 s, total: 38.3 s
Wall time: 9.59 s

{'Training MAE': np.float64(1989.0544948757317),
 'Valid MAE': np.float64(6029.137329100962),
 'Training RMSLE': np.float64(0.10373049008046713),
 'Valid RMSLE': np.float64(0.24897544966690316),
 'Training R^2': 0.9802744452357592,
 'Valid R^2': 0.8788749110488039}

Woah! Looks like our faster model evaluates (performs inference/makes predictions) in about half the time too.

And only for a small tradeoff in validation RMSLE performance.

5.4 Comparing our model's scores

We've built four models so far with varying amounts of data and hyperparameters.

Let's compile the results into a DataFrame and then make a plot to compare them.

In [119]:

# Add names of models to dictionaries
base_model_scores["model_name"] = "default_model"
rs_model_scores["model_name"] = "random_search_model"
ideal_model_scores["model_name"] = "ideal_model" 
fast_model_scores["model_name"] = "fast_model" 

# Turn all model score dictionaries into a list
all_model_scores = [base_model_scores, 
                    rs_model_scores, 
                    ideal_model_scores,
                    fast_model_scores]

# Create DataFrame and sort model scores by validation RMSLE
model_comparison_df = pd.DataFrame(all_model_scores).sort_values(by="Valid RMSLE", ascending=False)
model_comparison_df.head()

Out[119]:

Now we've got our model result data in a DataFrame, let's turn it into a bar plot comparing the validation RMSLE of each model.

In [120]:

# Get mean RSMLE score of all models
mean_rsmle_score = model_comparison_df["Valid RMSLE"].mean()

# Plot validation RMSLE against each other 
plt.figure(figsize=(10, 5))
plt.bar(x=model_comparison_df["model_name"],
        height=model_comparison_df["Valid RMSLE"].values)
plt.xlabel("Model")
plt.ylabel("Validation RMSLE (lower is better)")
plt.xticks(rotation=0, fontsize=10);
plt.axhline(y=mean_rsmle_score, 
            color="red", 
            linestyle="--", 
            label=f"Mean RMSLE: {mean_rsmle_score:.4f}")
plt.legend();

Out[120]:

By the looks of the plot, our ideal_model is indeed the ideal model, slightly edging out fast_model in terms of validation RMSLE.

6. Saving our best model to file

Since we've confirmed our best model as our ideal_model object, we can save it to file so we can load it in later and use it without having to retrain it.

Note: For more on model saving options with Scikit-Learn, see the documentation on model persistence.

To save our model we can use the joblib.dump method.

In [121]:

import joblib

bulldozer_price_prediction_model_name = "randomforest_regressor_best_RMSLE.pkl"

# Save model to file
joblib.dump(value=ideal_model, 
            filename=bulldozer_price_prediction_model_name)

Out[121]:

['randomforest_regressor_best_RMSLE.pkl']

And to load our model we can use the joblib.load method.

In [122]:

# Load the best model
best_model = joblib.load(filename=bulldozer_price_prediction_model_name)
best_model

Out[122]:

We can make sure our model saving and loading worked by evaluating our best_model with show_scores.

In [123]:

# Confirm that the model works
best_model_scores = show_scores(model=best_model)
best_model_scores

Out[123]:

{'Training MAE': np.float64(1955.9801186340424),
 'Valid MAE': np.float64(5979.47564414195),
 'Training RMSLE': np.float64(0.10224456852444506),
 'Valid RMSLE': np.float64(0.24733387014318542),
 'Training R^2': 0.9809704227866279,
 'Valid R^2': 0.8810497144604977}

And to confirm our ideal_model and best_model results are very close (if not the exact same), we can compare them with:

The equality operator ==.
np.iclose and setting the absolute tolerance (atol) to 1e-4.

In [124]:

# See if loaded model and pre-saved model results are the same
# Note: these values may be very slightly different depending on how precise your computer stores values.
best_model_scores["Valid RMSLE"] == ideal_model_scores["Valid RMSLE"]

Out[124]:

np.True_

In [125]:

# Is the loaded model as good as the non-loaded model?
if np.isclose(a=best_model_scores["Valid RMSLE"], 
              b=ideal_model_scores["Valid RMSLE"],
              atol=1e-4): # Make sure values are within 0.0001 of each other
    print(f"[INFO] Model results are close!")
else:
    print(f"[INFO] Model results aren't close, did something go wrong?")

Out[125]:

[INFO] Model results are close!

Note: When saving and loading a model, it is often the case to have very slightly different values at the extremes. For example, the pre-saved model may have an RMSLE of 0.24654150224930685 where as the loaded model may have an RMSLE of 0.24654150224930684 where in this case the values are off by 0.00000000000000001 (a very small number). This is due to the precision of computing and the way computers store values, where numbers are exact but can be represented up to a certain amount of precision. This is why we generally compare results with many decimals using np.isclose rather than the == operator.

7. Making predictions on test data

Now we've got a trained model saved and loaded, it's time to make predictions on the test data.

Our model is trained on data prior to 2011, however, the test data is from May 1 2012 to November 2012.

So what we're doing is trying to use the patterns our model has learned from the training data to predict the sale price of a bulldozer with characteristics it's never seen before but are assumed to be similar to that of those in the training data.

Let's load in the test data from Test.csv, we'll make sure to parse the dates of the saledate column.

In [126]:

# Load the test data
test_df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/Test.csv",
                      parse_dates=["saledate"])
test_df.head()

Out[126]:

You might notice that the test_df is missing the SalePrice column.

That's because that's the variable we're trying to predict based on all of the other variables.

We can make predictions with our best_model using the predict method.

In [127]:

# Let's see how the model goes predicting on the test data
test_preds = best_model.predict(X=test_df)

Out[127]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[127], line 2
      1 # Let's see how the model goes predicting on the test data
----> 2 test_preds = best_model.predict(X=test_df)
File ~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:1063, in ForestRegressor.predict(self, X)
   1061 check_is_fitted(self)
   1062 # Check data
-> 1063 X = self._validate_X_predict(X)
   1065 # Assign chunk of trees to jobs
   1066 n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
File ~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:641, in BaseForest._validate_X_predict(self, X)
    638 else:
    639     force_all_finite = True
--> 641 X = self._validate_data(
    642     X,
    643     dtype=DTYPE,
    644     accept_sparse="csr",
    645     reset=False,
    646     force_all_finite=force_all_finite,
    647 )
    648 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
    649     raise ValueError("No support for np.int64 index based sparse matrices")
File ~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py:608, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    537 def _validate_data(
    538     self,
    539     X="no_validation",
   (...)
    544     **check_params,
    545 ):
    546     """Validate input data and set or check the `n_features_in_` attribute.
    547 
    548     Parameters
   (...)
    606         validated.
    607     """
--> 608     self._check_feature_names(X, reset=reset)
    610     if y is None and self._get_tags()["requires_y"]:
    611         raise ValueError(
    612             f"This {self.__class__.__name__} estimator "
    613             "requires y to be passed, but the target y is None."
    614         )
File ~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py:535, in BaseEstimator._check_feature_names(self, X, reset)
    530 if not missing_names and not unexpected_names:
    531     message += (
    532         "Feature names must be in the same order as they were in fit.\n"
    533     )
--> 535 raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- saledate
Feature names seen at fit time, yet now missing:
- saleDay
- saleDayofweek
- saleDayofyear
- saleMonth
- saleYear

Oh no!

We get an error:

ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:
saledate Feature names seen at fit time, yet now missing:
saleDay
saleDayofweek
saleDayofyear
saleMonth
saleYear

Ahhh... the test data isn't in the same format of our other data, so we have to fix it.

7.1 Preprocessing the test data (to be in the same format as the training data)

Our model has been trained on data preprocessed in a certain way.

This means in order to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.

Remember, whatever you do to preprocess the training data, you have to do to the test data.

Let's recreate the steps we used for preprocessing the training data except this time we'll do it on the test data.

First, we'll add the extra date features to breakdown the saledate column.

In [128]:

# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

# Preprocess test_df to have same columns as train_df (add the datetime features)
test_df = add_datetime_features_to_df(df=test_df)
test_df.head()

Out[128]:

Date features added!

Now can we make predictions with our model on the test data?

In [129]:

# Try to predict with model
test_preds = best_model.predict(test_df)

Out[129]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/2042912174.py in ?()
# Try to predict with model
----> 2 test_preds = best_model.predict(test_df)

~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X)
           The predicted values.
       """
       check_is_fitted(self)
       # Check data
-> 1063         X = self._validate_X_predict(X)

       # Assign chunk of trees to jobs
       n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py in ?(self, X)
           force_all_finite = "allow-nan"
       else:
           force_all_finite = True

--> 641         X = self._validate_data(
           X,
           dtype=DTYPE,
           accept_sparse="csr",
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
               out = y
           else:
               out = X, y
       elif not no_val_X and no_val_y:
--> 633             out = check_array(X, input_name="X", **check_params)
       elif no_val_X and not no_val_y:
           out = _check_y(y, **check_params)
       else:
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
                       )
                   array = xp.astype(array, dtype, copy=False)
               else:
                   array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
-> 1013             except ComplexWarning as complex_warning:
               raise ValueError(
                   "Complex data not supported\n{}\n".format(array)
               ) from complex_warning
~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp, device)
       # Use NumPy API to support order
       if copy is True:
           array = numpy.array(array, order=order, dtype=dtype)
       else:
--> 751             array = numpy.asarray(array, order=order, dtype=dtype)

       # At this point array is a NumPy ndarray. We convert it to an array
       # container that is consistent with the input's namespace.
~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
   def __array__(
       self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
   ) -> np.ndarray:
       values = self._values
-> 2153         arr = np.asarray(values, dtype=dtype)
       if (
           astype_is_view(values.dtype, arr.dtype)
           and using_copy_on_write()
ValueError: could not convert string to float: 'Low'

Another error...

ValueError: could not convert string to float: 'Low'

We can fix this by running our ordinal_encoder (that we used to preprocess the training data) on the categorical features in our test DataFrame.

In [130]:

# Create a copy of the test DataFrame to keep the original intact
test_df_preprocessed = test_df.copy()

# Transform the categorical features of the test DataFrame into numbers
test_df_preprocessed[categorical_features] = ordinal_encoder.transform(test_df_preprocessed[categorical_features].astype(str))
test_df_preprocessed.info()

Out[130]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12457 entries, 0 to 12456
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 SalesID                   12457 non-null  int64  
 MachineID                 12457 non-null  int64  
 ModelID                   12457 non-null  int64  
 datasource                12457 non-null  int64  
 auctioneerID              12457 non-null  int64  
 YearMade                  12457 non-null  int64  
 MachineHoursCurrentMeter  2129 non-null   float64
 UsageBand                 12457 non-null  float64
 fiModelDesc               12349 non-null  float64
 fiBaseModel               12431 non-null  float64
fiSecondaryDesc           12449 non-null  float64
fiModelSeries             12456 non-null  float64
fiModelDescriptor         12452 non-null  float64
ProductSize               12457 non-null  float64
fiProductClassDesc        12457 non-null  float64
state                     12457 non-null  float64
ProductGroup              12457 non-null  float64
ProductGroupDesc          12457 non-null  float64
Drive_System              12457 non-null  float64
Enclosure                 12457 non-null  float64
Forks                     12457 non-null  float64
Pad_Type                  12457 non-null  float64
Ride_Control              12457 non-null  float64
Stick                     12457 non-null  float64
Transmission              12457 non-null  float64
Turbocharged              12457 non-null  float64
Blade_Extension           12457 non-null  float64
Blade_Width               12457 non-null  float64
Enclosure_Type            12457 non-null  float64
Engine_Horsepower         12457 non-null  float64
Hydraulics                12457 non-null  float64
Pushblock                 12457 non-null  float64
Ripper                    12457 non-null  float64
Scarifier                 12457 non-null  float64
Tip_Control               12457 non-null  float64
Tire_Size                 12457 non-null  float64
Coupler                   12457 non-null  float64
Coupler_System            12457 non-null  float64
Grouser_Tracks            12457 non-null  float64
Hydraulics_Flow           12457 non-null  float64
Track_Type                12457 non-null  float64
Undercarriage_Pad_Width   12457 non-null  float64
Stick_Length              12457 non-null  float64
Thumb                     12457 non-null  float64
Pattern_Changer           12457 non-null  float64
Grouser_Type              12457 non-null  float64
Backhoe_Mounting          12457 non-null  float64
Blade_Type                12457 non-null  float64
Travel_Controls           12457 non-null  float64
Differential_Type         12457 non-null  float64
Steering_Controls         12457 non-null  float64
saleYear                  12457 non-null  int32  
saleMonth                 12457 non-null  int32  
saleDay                   12457 non-null  int32  
saleDayofweek             12457 non-null  int32  
saleDayofyear             12457 non-null  int32  
dtypes: float64(45), int32(5), int64(6)
memory usage: 5.1 MB

Ok, date features created and categorical features turned into numbers, can we make predictions on the test data now?

In [131]:

# Make predictions on the preprocessed test data
test_preds = best_model.predict(test_df_preprocessed)

Holy smokes! It worked!

Let's check out our test_preds.

In [132]:

# Check the first 10 test predictions
test_preds[:10]

Out[132]:

array([14384.79497354, 31377.65862841, 48589.23540965, 95857.57194966,
       26910.53992304, 29401.41534392, 27061.53819945, 20377.23364598,
       17325.67857143, 33646.67768959])

Wonderful, looks like we're getting the price predictions of a given bulldozer.

How many predictions are there?

In [133]:

# Check number of test predictions
test_preds.shape, test_df.shape

Out[133]:

((12457,), (12457, 56))

Perfect, looks like theres one prediction per sample in the test DataFrame.

Now how would we submit our predictions to Kaggle?

Well, when looking at the Kaggle submission requirements, we see that if we wanted to make a submission, the data is required to be in a certain format.

Namely, a DataFrame containing the SalesID and the predicted SalePrice of the bulldozer.

Let's make it.

In [134]:

# Create DataFrame compatible with Kaggle submission requirements
pred_df = pd.DataFrame()
pred_df["SalesID"] = test_df["SalesID"]
pred_df["SalePrice"] = test_preds
pred_df.sample(5)

Out[134]:

Excellent! We've got a SalePrice prediction for every SalesID in the test DataFrame.

Let's save this to CSV so we could upload it or share it with someone else if we had to.

In [135]:

# Export test dataset predictions to CSV
pred_df.to_csv("../data/bluebook-for-bulldozers/predictions.csv",
               index=False)

8. Making a prediction on a custom sample

We've made predictions on the test dataset which contains sale data from May to November 2012.

But how does our model go on a more recent bulldozer sale?

If we were to find an advertisement on a bulldozer sale, could we use our model on the information in the advertisement to predict the sale price?

In other words, how could we use our model on a single custom sample?

It's one thing to predict on data that has already been formatted but it's another thing to be able to predict a on a completely new and unseen sample.

Note: For predicting on a custom sample, the same rules apply as making predictions on the test dataset. The data you make predictions on should be in the same format that your model was trained on. For example, it should have all the same features and the numerical encodings should be in the same ballpark (e.g. preprocessed by the ordinal_encoder we fit to the training set). It's likely that samples you collect from the wild may not be as well formatted as samples in a pre-existing dataset. So it's the job of the machine learning engineer to be able to format/preprocess new samples in the same way a model was trained on.

If we're going to make a prediction on a custom sample, it'll need to be in the same format as our other datasets.

So let's remind ourselves of the columns/features in our test dataset.

In [136]:

# Get example from test_df
test_df_preprocessed_sample = test_df_preprocessed.sample(n=1, random_state=42)

# Turn back into original format
test_df_unpreprocessed_sample = test_df_preprocessed_sample.copy() 
test_df_unpreprocessed_sample[categorical_features] = ordinal_encoder.inverse_transform(test_df_unpreprocessed_sample[categorical_features])
test_df_unpreprocessed_sample.to_dict(orient="records")

Out[136]:

[{'SalesID': 1229148,
  'MachineID': 1042578,
  'ModelID': 9579,
  'datasource': 121,
  'auctioneerID': 3,
  'YearMade': 2004,
  'MachineHoursCurrentMeter': 3290.0,
  'UsageBand': 'Medium',
  'fiModelDesc': 'S250',
  'fiBaseModel': 'S250',
  'fiSecondaryDesc': 'nan',
  'fiModelSeries': 'nan',
  'fiModelDescriptor': 'nan',
  'ProductSize': 'nan',
  'fiProductClassDesc': 'Skid Steer Loader - 2201.0 to 2701.0 Lb Operating Capacity',
  'state': 'Missouri',
  'ProductGroup': 'SSL',
  'ProductGroupDesc': 'Skid Steer Loaders',
  'Drive_System': 'nan',
  'Enclosure': 'EROPS',
  'Forks': 'None or Unspecified',
  'Pad_Type': 'nan',
  'Ride_Control': 'nan',
  'Stick': 'nan',
  'Transmission': 'nan',
  'Turbocharged': 'nan',
  'Blade_Extension': 'nan',
  'Blade_Width': 'nan',
  'Enclosure_Type': 'nan',
  'Engine_Horsepower': 'nan',
  'Hydraulics': 'Auxiliary',
  'Pushblock': 'nan',
  'Ripper': 'nan',
  'Scarifier': 'nan',
  'Tip_Control': 'nan',
  'Tire_Size': 'nan',
  'Coupler': 'Hydraulic',
  'Coupler_System': 'Yes',
  'Grouser_Tracks': 'None or Unspecified',
  'Hydraulics_Flow': 'Standard',
  'Track_Type': 'nan',
  'Undercarriage_Pad_Width': 'nan',
  'Stick_Length': 'nan',
  'Thumb': 'nan',
  'Pattern_Changer': 'nan',
  'Grouser_Type': 'nan',
  'Backhoe_Mounting': 'nan',
  'Blade_Type': 'nan',
  'Travel_Controls': 'nan',
  'Differential_Type': 'nan',
  'Steering_Controls': 'nan',
  'saleYear': 2012,
  'saleMonth': 6,
  'saleDay': 15,
  'saleDayofweek': 4,
  'saleDayofyear': 167}]

Wonderful, so if we're going to make a prediction on a custom sample, we'll need to fill out these details as much as we can.

Let's try and make a prediction on the example test sample.

In [137]:

# Make a prediction on the preprocessed test sample
best_model.predict(test_df_preprocessed_sample)

Out[137]:

array([13519.31657848])

Nice!

We get an output array containing a predicted SalePrice.

Let's now try it on a custom sample.

Again, like all good machine learning cooking shows, I've searched the internet for "bulldozer sales in America" and found a sale from 6th July 2024 (I'm writing these materials in mid 2024 so if it's many years in the future and the link doesn't work, check out the screenshot below).

| Image of a 2004 Caterpillar D6R XL dozer from an online auction listing. The dozer is yellow, has an enclosed cab with AC and heat, and a large front blade. Key specs shown include 11,770 hours, a Caterpillar C9 diesel engine, and joystick steering. Tracks are 22 inches wide with single grouser pads. The contract price is $72,600. | |:--😐 | Screenshot of a bulldozer sale advertisement. I took information from this advertisement to create our own custom sample for testing our machine learning model on data from the wild. Source. |

I went through the advertisement online and collected as much detail as I could and formatted the dictionary below with all of the related fields.

It may not be perfect but data in the real world is rarely perfect.

For values I couldn't find or were inconspicuous, I filled them with np.nan (or NaN).

Some values such as SalesID were unobtainable because they were part of the original collected dataset, for these I've also used np.nan.

Also notice how I've already created the extra date features saleYear, saleMonth, saleDay and more by manually breaking down the listed sale date of 6 July 2024.

In [138]:

# Create a dictionary of features and values from an internet-based bulldozer advertisement
# See link: https://www.purplewave.com/auction/240606/item/EK8504/2004-Caterpillar-D6R_XL-Crawlers-Crawler_Dozer-Missouri (note: this link is/was valid as of October 2024 but may be invalid in the future)
custom_sample = {
  "SalesID": np.nan,
  "MachineID": 8504,
  "ModelID": np.nan,
  "datasource": np.nan,
  "auctioneerID": np.nan,
  "YearMade": 2004,
  "MachineHoursCurrentMeter": 11770.0,
  "UsageBand": "High",
  "fiModelDesc": "D6RXL",
  "fiBaseModel": "D6",
  "fiSecondaryDesc": "XL",
  "fiModelSeries": np.nan,
  "fiModelDescriptor": np.nan,
  "ProductSize": "Medium",
  "fiProductClassDesc": "Track Type Tractor, Dozer - 130.0 to 160.0 Horsepower",
  "state": "Missouri",
  "ProductGroup": "TTT",
  "ProductGroupDesc": "Track Type Tractors",
  "Drive_System": "No",
  "Enclosure": "EROPS",
  "Forks": "None or Unspecified",
  "Pad_Type": "Grouser",
  "Ride_Control": "None or Unspecified",
  "Stick": "nan",
  "Transmission": "Powershift",
  "Turbocharged": "None or Unspecified",
  "Blade_Extension": "None or Unspecified",
  "Blade_Width": np.nan,
  "Enclosure_Type": np.nan,
  "Engine_Horsepower": np.nan,
  "Hydraulics": np.nan,
  "Pushblock": "None or Unspecified",
  "Ripper": "None or Unspecified",
  "Scarifier": "None or Unspecified",
  "Tip_Control": "Tip",
  "Tire_Size": np.nan,
  "Coupler": np.nan,
  "Coupler_System": np.nan,
  "Grouser_Tracks": "Yes",
  "Hydraulics_Flow": np.nan,
  "Track_Type": "Steel",
  "Undercarriage_Pad_Width": "22 inch",
  "Stick_Length": np.nan,
  "Thumb": np.nan,
  "Pattern_Changer": np.nan,
  "Grouser_Type": "Single",
  "Backhoe_Mounting": "None or Unspecified",
  "Blade_Type": "Semi U",
  "Travel_Controls": np.nan,
  "Differential_Type": np.nan,
  "Steering_Controls": "Command Control",
  "saleYear": 2024,
  "saleMonth": 6,
  "saleDay": 7,
  "saleDayofweek": 5,
  "saleDayofyear": 159
}

Now we've got a single custom sample in the form of a dictionary, we can turn it into a DataFrame.

In [139]:

# Turn single sample in a DataFrame
custom_sample_df = pd.DataFrame(custom_sample, index=[0])
custom_sample_df.head()

Out[139]:

And of course, we can preprocess the categoricial features using our ordinal_encoder (we use the same instance of OrdinalEncoder as we trained on the training dataset).

In [140]:

# Transform the categorical features of the custom sample
custom_sample_df[categorical_features] = ordinal_encoder.transform(custom_sample_df[categorical_features].astype(str))
custom_sample_df

Out[140]:

Custom sample preprocessed, let's make a prediction!

In [141]:

# Make a prediction on the preprocessed custom sample
custom_sample_pred = best_model.predict(custom_sample_df)
print(f"[INFO] Predicted sale price of custom sample: ${round(custom_sample_pred[0], 2)}")

Out[141]:

[INFO] Predicted sale price of custom sample: $51474.96

Now how close was this to the actual sale price (listed on the advertisement) of $72,600?

In [142]:

from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error

# Evaluate our model versus the actual sale price
custom_sample_actual_sale_price = [72600] # this is the sale price listed on the advertisement

print(f"[INFO] Model MAE on custom sample: {mean_absolute_error(y_pred=custom_sample_pred, y_true=custom_sample_actual_sale_price)}")
print(f"[INFO] Model RMSLE on custom sample: {root_mean_squared_log_error(y_pred=custom_sample_pred, y_true=custom_sample_actual_sale_price)}")

Out[142]:

[INFO] Model MAE on custom sample: 21125.040564373892
[INFO] Model RMSLE on custom sample: 0.3438638042344433

Woah!

We get a quite high MAE value, however, it looks like our model's RMSLE performance on the custom sample was even better than the best_model on the validation dataset.

Not too bad for a model trained on sales data over 12 years older than our custom sample's sale date.

Note: In practice, to make this process easier, rather than manually typing out all of the feature values by hand, you might want to create an application capable of ingesting these values in a nice user interface. To create such machine learning applications, I'd practice by checking out Streamlit or Gradio.

9. Finding the most important predictive features

Since we've built a model which is able to make predictions, the people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.

This is where feature importance comes in.

Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the target variable.

In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?

We can do this for our sklearn.ensemble.RandomForestRegressor instance using the feature_importances_ attribute.

Let's check it out.

In [143]:

# Find feature importance of our best model
best_model_feature_importances = best_model.feature_importances_
best_model_feature_importances

Out[143]:

array([3.78948522e-02, 2.70954102e-02, 5.85804002e-02, 1.79438322e-03,
       5.25621132e-03, 1.92040831e-01, 6.71461619e-03, 1.42137572e-03,
       4.79324438e-02, 4.73967258e-02, 4.12235661e-02, 4.75379381e-03,
       2.55283197e-02, 1.60578799e-01, 5.08919397e-02, 8.34245434e-03,
       3.43077232e-03, 4.16871935e-03, 1.45645185e-03, 6.32089976e-02,
       1.93106853e-03, 7.91189110e-04, 2.16468186e-03, 2.42755109e-04,
       1.44729959e-03, 1.10292279e-04, 4.69525167e-03, 4.70046399e-03,
       2.18877572e-03, 4.03668217e-03, 4.46781002e-03, 2.86947732e-03,
       5.20668987e-03, 3.50894384e-03, 1.75215277e-03, 1.16769900e-02,
       1.84682779e-03, 2.08450645e-02, 1.17370327e-02, 5.26785421e-03,
       2.07101299e-03, 1.36424627e-03, 1.60680297e-03, 9.71604299e-04,
       7.85735364e-04, 7.29302663e-04, 6.74283032e-04, 3.28828690e-03,
       2.83781098e-03, 3.92432694e-04, 4.73081800e-04, 7.26042221e-02,
       5.42512552e-03, 8.54840059e-03, 4.42773246e-03, 1.26015531e-02])

Woah, looks like we get one value per feature in our dataset.

In [144]:

print(f"[INFO] Number of feature importance values: {best_model_feature_importances.shape[0]}") 
print(f"[INFO] Number of features in training dataset: {X_train_preprocessed.shape[1]}")

Out[144]:

[INFO] Number of feature importance values: 56
[INFO] Number of features in training dataset: 56

We can inspect these further by turning them into a DataFrame.

We'll sort it descending order so we can see which feature our model is assigning the highest value.

In [145]:

# Create feature importance DataFrame
column_names = test_df.columns
feature_importance_df = pd.DataFrame({"feature_names": column_names,
                                      "feature_importance": best_model_feature_importances}).sort_values(by="feature_importance",
                                                                                                         ascending=False)
feature_importance_df.head()

Out[145]:

Hmmm... looks like YearMade may be contributing the most value in the model's eyes.

How about we turn our DataFrame into a plot to compare values?

In [146]:

# Plot the top feature importance values
top_n = 20
plt.figure(figsize=(10, 5))
plt.barh(y=feature_importance_df["feature_names"][:top_n], # Plot the top_n feature importance values
         width=feature_importance_df["feature_importance"][:top_n])
plt.title(f"Top {top_n} Feature Importance Values for Best RandomForestRegressor Model")
plt.xlabel("Feature importance value")
plt.ylabel("Feature name")
plt.gca().invert_yaxis();

Out[146]:

Ok, looks like the top 4 features contributing to our model's predictions are YearMade, ProductSize, Enclosure and saleYear.

Referring to the original data dictionary, do these values make sense to be contributing the most to the model?

YearMade - Year of manufacture of the machine.
ProductSize - Size of the bulldozer.
Enclosure - Type of bulldozer enclosure (e.g. OROPS = Open Rollover Protective Structures, EROPS = Enclosed Rollover Protective Structures).
saleYear - The year the bulldozer was sold (this is one of our engineered features from saledate).

Now I've never sold a bulldozer but reading about each of these values seems to make sense that they would contribute significantly to the sale price.

I know when I've bought cars in the past, the year that is was made was an important part of my decision.

And it also makes sense that ProductSize be an important feature when deciding on the price of a bulldozer.

Let's check out the unique values for ProductSize and Enclosure.

In [147]:

print(f"[INFO] Unique ProductSize values: {train_df['ProductSize'].unique()}")
print(f"[INFO] Unique Enclosure values: {train_df['Enclosure'].unique()}")

Out[147]:

[INFO] Unique ProductSize values: ['Medium' nan 'Compact' 'Small' 'Large' 'Large / Medium' 'Mini']
[INFO] Unique Enclosure values: ['OROPS' 'EROPS' 'EROPS w AC' nan 'EROPS AC' 'NO ROPS'
 'None or Unspecified']

My guess is that a bulldozer with a ProductSize of 'Mini' would sell for less than a bulldozer with a size of 'Large'.

We could investigate this further in an extension to model driven data exploratory analysis or we could take this information to a colleague or client to discuss further.

Either way, we've now got a machine learning model capable of predicting the sale price of bulldozers given their features/attributes!

That's a huuuuuuge effort!

And you should be very proud of yourself for making it this far.

Summary

We've covered a lot of ground.

But there are some main takeaways to go over.

Every machine learning problem is different - Since machine learning is such a widespread technology, it can be used for a multitude of different problems. In saying this, there will often be many different ways to approach a problem. In this example, we've focused on predicting a number, which is a regression problem. And since our data had a time component, it could also be considered a time series problem.
The machine learner's motto: Experiment, experiment, experiment! - Since there are many different ways to approach machine learning problems, one of the best habits you can develop is an experimental mindset. That means not being afraid to try new things over and over. Because the more things you try, the quicker you can figure what doesn't work and the quicker you can start to move towards what does.
Always keep the test set separate - If you can't evaluate your model on unseen data, how would you know how it will perform in the real world on future unseen data? Of course, using a test set isn't a perfect replica of the real world but if it's done right, it can give you a good idea. Because evaluating a model is just as important as training a model.
If you've trained a model on a data in a certain format, you'll have to make predictions in the same format - Any preprocessing you do to the training dataset, you'll have to do to the validation, test and custom data. Any computed values should happen on the training set only and then be used to update any subsequent datasets.

Exercises

Fill the missing values in the numeric columns with the median using Scikit-Learn and see if that helps our best model's performance (hint: see sklearn.impute.SimpleImputer for more).
Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's sklearn.pipeline.Pipeline features.
Try using another regression model/estimator on our preprocessed dataset and see how it goes. See the Scikit-Learn machine learning map for potential model options.
Try replacing the sklearn.preprocessing.OrdinalEncoder we used for the categorical variables with sklearn.preprocessing.OneHotEncoder (you may even want to do this within a pipeline) with the sklearn.ensemble.RandomForestRegressor model and see how it performs. Which is better for our specific dataset?

Extra-curriculum

The following resources are suggested extra reading and activities to add backing to the materials we've covered in this project.

Reading documentation and knowing where to find information is one of the best skills you can develop as an engineer.

Read the pandas IO tools documentation page for an idea of all the possible ways to get data in and out of pandas.
See all of the available datatypes in the pandas user guide (knowing what type your data is in can help prevent a lot of future errors).
Read the Scikit-Learn dataset transformations and data preprocessing guide for an overview of all the different ways you can preprocess and transform data.
For more on saving and loading model objects with Scikit-Learn, see the documentation on model persistence.
For more on the importance of creating good validation and test sets, I'd recommend reading How (and why) to create a good validation set by Rachel Thomas as well as The importance of a test set by Daniel Bourke.
We've covered a handful of models in the Scikit-Learn library, however, there are some other ML models which are worth exploring such as CatBoost and XGBoost. Both of these models can handle missing values and are often touted as some of the most performant ML models on the market. A good extension would be to try get one of them working on our bulldozer data.
- Bonus: You can also see a list of models in Scikit-Learn which can handle missing/NaN values.

Example Exercise Solutions

The following are examples of how to solve the above exercises.

1. Fill the missing values in the numeric columns with the median using Scikit-Learn and see if that helps our best model's performance

In [244]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor

# Import train samples (making sure to parse dates and then sort by them)
train_df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

# Import validation samples (making sure to parse dates and then sort by them)
valid_df = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv
valid_solution = pd.read_csv(filepath_or_buffer="../data/bluebook-for-bulldozers/ValidSolution.csv")

# Map valid_solution to valid_df
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])

# Make sure valid_df is sorted by saledate still
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# How many samples are in each DataFrame?
print(f"[INFO] Number of samples in training DataFrame: {len(train_df)}")
print(f"[INFO] Number of samples in validation DataFrame: {len(valid_df)}")

# Make a function to add date columns
def add_datetime_features_to_df(df, date_column="saledate"):
    # Add datetime parameters for saledate
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear

    # Drop original saledate column
    df.drop("saledate", axis=1, inplace=True)

    return df

# Add datetime features to DataFrames
train_df = add_datetime_features_to_df(df=train_df)
valid_df = add_datetime_features_to_df(df=valid_df)

# Split training data into features and labels
X_train = train_df.drop("SalePrice", axis=1)
y_train = train_df["SalePrice"]

# Split validation data into features and labels
X_valid = valid_df.drop("SalePrice", axis=1)
y_valid = valid_df["SalePrice"]

# Define numerical and categorical features
numeric_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]

### Filling missing values ### 

# Create an ordinal encoder (turns category items into numeric representation)
ordinal_encoder = OrdinalEncoder(categories="auto",
                                 handle_unknown="use_encoded_value",
                                 unknown_value=np.nan,
                                 encoded_missing_value=np.nan) # treat unknown categories as np.nan (or None)

# Create a simple imputer to fill missing values with median
simple_imputer_median = SimpleImputer(missing_values=np.nan,
                                      strategy="median")

# Fit and transform the categorical and numerical columns of X_train
X_train_preprocessed = X_train.copy() # make copies of the oringal DataFrames so we can keep the original values in tact and view them later
X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str)) # OrdinalEncoder expects all values as the same type (e.g. string or numeric only)
X_train_preprocessed[numerical_features] = simple_imputer_median.fit_transform(X_train_preprocessed[numerical_features])

# Transform the categorical and numerical columns of X_valid 
X_valid_preprocessed = X_valid.copy()
X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str)) # only use `transform` on the validation data
X_valid_preprocessed[numerical_features] = simple_imputer_median.transform(X_valid_preprocessed[numerical_features])

# Create function to evaluate our model
def show_scores(model, 
                train_features=X_train_preprocessed,
                train_labels=y_train,
                valid_features=X_valid_preprocessed,
                valid_labels=y_valid):
    
    # Make predictions on train and validation features
    train_preds = model.predict(X=train_features)
    val_preds = model.predict(X=valid_features)

    # Create a scores dictionary of different evaluation metrics
    scores = {"Training MAE": mean_absolute_error(y_true=train_labels, 
                                                  y_pred=train_preds),
              "Valid MAE": mean_absolute_error(y_true=valid_labels, 
                                               y_pred=val_preds),
              "Training RMSLE": root_mean_squared_log_error(y_true=train_labels, 
                                                            y_pred=train_preds),
              "Valid RMSLE": root_mean_squared_log_error(y_true=valid_labels, 
                                                         y_pred=val_preds),
              "Training R^2": model.score(X=train_features, 
                                          y=train_labels),
              "Valid R^2": model.score(X=valid_features, 
                                       y=valid_labels)}
    return scores

# Instantiate a model with best hyperparameters 
ideal_model_2 = RandomForestRegressor(n_estimators=90,
                                      max_depth=None,
                                      min_samples_leaf=1,
                                      min_samples_split=5,
                                      max_features=0.5,
                                      n_jobs=-1,
                                      max_samples=None)

# Fit a model to the preprocessed data
ideal_model_2.fit(X=X_train_preprocessed, 
                  y=y_train)

# Evalute the model
ideal_model_2_scores = show_scores(model=ideal_model_2)
ideal_model_2_scores

Out[244]:

[INFO] Number of samples in training DataFrame: 401125
[INFO] Number of samples in validation DataFrame: 11573

{'Training MAE': np.float64(1951.2971558280735),
 'Valid MAE': np.float64(5964.025764507629),
 'Training RMSLE': np.float64(0.101909965049995),
 'Valid RMSLE': np.float64(0.24697812443315573),
 'Training R^2': 0.9810825663665007,
 'Valid R^2': 0.8809697755766817}

Looks like filling the missing numeric values made our ideal_model_2 perform slightly worse than our original ideal_model.

ideal_model_2 had a validation RMSLE of 0.24697812443315573 where as ideal_model had a validation RMSLE of 0.24654150224930685.

2. Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's `sklearn.pipeline.Pipeline`

In [247]:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline


# Import and prepare data
train_df = pd.read_csv("../data/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

valid_df = pd.read_csv("../data/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

valid_solution = pd.read_csv("../data/bluebook-for-bulldozers/ValidSolution.csv")
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# Add datetime features
def add_datetime_features_to_df(df, date_column="saledate"):
    df = df.copy()
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear
    return df.drop(date_column, axis=1)

# Apply datetime features
train_df = add_datetime_features_to_df(train_df)
valid_df = add_datetime_features_to_df(valid_df)

# Split data into features and labels
X_train = train_df.drop("SalePrice", axis=1)
y_train = train_df["SalePrice"]
X_valid = valid_df.drop("SalePrice", axis=1)
y_valid = valid_df["SalePrice"]

# Define feature types
numeric_features = [label for label, content in X_train.items() 
                   if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() 
                       if not pd.api.types.is_numeric_dtype(content)]

# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('string_converter', FunctionTransformer(lambda x: x.astype(str))), # convert values to string
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value',
                              unknown_value=np.nan,
                              encoded_missing_value=np.nan)),
])

# Create preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical_transforms', numeric_transformer, numeric_features),
        ('categorical_transforms', categorical_transformer, categorical_features)
    ])

# Create full pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(
        n_estimators=90,
        max_depth=None,
        min_samples_leaf=1,
        min_samples_split=5,
        max_features=0.5,
        n_jobs=-1,
        max_samples=None
    ))
])

# Function to evaluate the pipeline
def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):
    # Make predictions
    train_preds = pipeline.predict(X_train)
    valid_preds = pipeline.predict(X_valid)
    
    # Calculate scores
    scores = {
        "Training MAE": mean_absolute_error(y_train, train_preds),
        "Valid MAE": mean_absolute_error(y_valid, valid_preds),
        "Training RMSLE": root_mean_squared_log_error(y_train, train_preds),
        "Valid RMSLE": root_mean_squared_log_error(y_valid, valid_preds),
        "Training R^2": pipeline.score(X_train, y_train),
        "Valid R^2": pipeline.score(X_valid, y_valid)
    }
    return scores

# Fit and evaluate pipeline
model_pipeline.fit(X_train, y_train)
pipeline_scores = evaluate_pipeline(model_pipeline, X_train, y_train, X_valid, y_valid)
print("\nPipeline Scores:")
pipeline_scores

Out[247]:

Pipeline Scores:

{'Training MAE': np.float64(1951.4776197781914),
 'Valid MAE': np.float64(5974.931566226864),
 'Training RMSLE': np.float64(0.10196097739473307),
 'Valid RMSLE': np.float64(0.24760612684722114),
 'Training R^2': 0.9811027965058758,
 'Valid R^2': 0.8807288353268701}

3. Try using another regression model/estimator on our preprocessed dataset and see how it goes

Going to use sklearn.linear_model.HistGradientBoostingRegressor.

In [148]:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer, StandardScaler
from sklearn.pipeline import Pipeline

# Import and prepare data
train_df = pd.read_csv("../data/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

valid_df = pd.read_csv("../data/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

valid_solution = pd.read_csv("../data/bluebook-for-bulldozers/ValidSolution.csv")
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# Add datetime features
def add_datetime_features_to_df(df, date_column="saledate"):
    df = df.copy()
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear
    return df.drop(date_column, axis=1)

# Apply datetime features
train_df = add_datetime_features_to_df(train_df)
valid_df = add_datetime_features_to_df(valid_df)

# Split data into features and labels
X_train = train_df.drop("SalePrice", axis=1)
y_train = train_df["SalePrice"]
X_valid = valid_df.drop("SalePrice", axis=1)
y_valid = valid_df["SalePrice"]

# Define feature types
numeric_features = [label for label, content in X_train.items() 
                   if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() 
                       if not pd.api.types.is_numeric_dtype(content)]

# Create preprocessing steps for different types of values
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])

categorical_transformer = Pipeline(steps=[
    ('string_converter', FunctionTransformer(lambda x: x.astype(str))), # convert values to string
    ('ordinal', OrdinalEncoder(categories='auto',
                               handle_unknown='use_encoded_value',
                               unknown_value=np.nan,
                               encoded_missing_value=np.nan)), 
])

# Create preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical_transforms', numeric_transformer, numeric_features),
        ('categorical_transforms', categorical_transformer, categorical_features)
    ])

# Create full pipeline
model_pipeline_hist_gradient_boosting_regressor = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', HistGradientBoostingRegressor()) # Change model to HistGradientBoostingRegressor
])

# Function to evaluate the pipeline
def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):
    # Make predictions
    train_preds = pipeline.predict(X_train)
    valid_preds = pipeline.predict(X_valid)
    
    # Calculate scores
    scores = {
        "Training MAE": mean_absolute_error(y_train, train_preds),
        "Valid MAE": mean_absolute_error(y_valid, valid_preds),
        "Training RMSLE": root_mean_squared_log_error(y_train, train_preds),
        "Valid RMSLE": root_mean_squared_log_error(y_valid, valid_preds),
        "Training R^2": pipeline.score(X_train, y_train),
        "Valid R^2": pipeline.score(X_valid, y_valid)
    }
    return scores

# Fit and evaluate pipeline
print(f"[INFO] Fitting HistGradientBoostingRegressor model with pipeline...")
model_pipeline_hist_gradient_boosting_regressor.fit(X_train, y_train)
print(f"[INFO] Evaluating HistGradientBoostingRegressor model with pipeline...")
pipeline_hist_scores = evaluate_pipeline(model_pipeline_hist_gradient_boosting_regressor, X_train, y_train, X_valid, y_valid)
print("\nPipeline HistGradientBoostingRegressor Scores:")
pipeline_hist_scores

Out[148]:

[INFO] Fitting HistGradientBoostingRegressor model with pipeline...
[INFO] Evaluating HistGradientBoostingRegressor model with pipeline...

Pipeline HistGradientBoostingRegressor Scores:

{'Training MAE': np.float64(5638.6121797753785),
 'Valid MAE': np.float64(7264.258786098576),
 'Training RMSLE': np.float64(0.2691456681483351),
 'Valid RMSLE': np.float64(0.30482586120872424),
 'Training R^2': 0.8646511348082063,
 'Valid R^2': 0.8319021596407035}

4. Try replacing the `sklearn.preprocessing.OrdinalEncoder` we used for the categorical variables with `sklearn.preprocessing.OneHotEncoder`

Note: This may take quite a long time depending on your machine. For example, on my MacBook Pro M1 Pro it took ~10 minutes with n_estimators=10 (9x lower than what we used for our best_model). This is because using sklearn.preprocessing.OneHotEncoder adds many more features to our dataset (each feature gets turned into an array of 0's and 1's for each unique value). And the more features, the longer it takes to compute and find patterns between them.

In [7]:

%%time

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# Import and prepare data
train_df = pd.read_csv("../data/bluebook-for-bulldozers/Train.csv",
                       parse_dates=["saledate"],
                       low_memory=False).sort_values(by="saledate", ascending=True)

valid_df = pd.read_csv("../data/bluebook-for-bulldozers/Valid.csv",
                       parse_dates=["saledate"])

valid_solution = pd.read_csv("../data/bluebook-for-bulldozers/ValidSolution.csv")
valid_df["SalePrice"] = valid_df["SalesID"].map(valid_solution.set_index("SalesID")["SalePrice"])
valid_df = valid_df.sort_values("saledate", ascending=True).reset_index(drop=True)

# Add datetime features
def add_datetime_features_to_df(df, date_column="saledate"):
    df = df.copy()
    df["saleYear"] = df[date_column].dt.year
    df["saleMonth"] = df[date_column].dt.month
    df["saleDay"] = df[date_column].dt.day
    df["saleDayofweek"] = df[date_column].dt.dayofweek
    df["saleDayofyear"] = df[date_column].dt.dayofyear
    return df.drop(date_column, axis=1)

# Apply datetime features
train_df = add_datetime_features_to_df(train_df)
valid_df = add_datetime_features_to_df(valid_df)

# Split data
X_train = train_df.drop("SalePrice", axis=1)
y_train = train_df["SalePrice"]
X_valid = valid_df.drop("SalePrice", axis=1)
y_valid = valid_df["SalePrice"]

# Define feature types
numeric_features = [label for label, content in X_train.items() 
                   if pd.api.types.is_numeric_dtype(content)]
categorical_features = [label for label, content in X_train.items() 
                       if not pd.api.types.is_numeric_dtype(content)]

# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('string_converter', FunctionTransformer(lambda x: x.astype(str))),
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # fill missing values with the term "missing"
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True)) # use OneHotEncoder instead of OrdinalEncoder
])

# Create preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    verbose_feature_names_out=False  # Simplify feature names
)

# Create full pipeline
model_one_hot_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(
        n_estimators=10,
        max_depth=None,
        min_samples_leaf=1,
        min_samples_split=5,
        max_features=0.5,
        n_jobs=-1,
        max_samples=None
    ))
])

# Function to evaluate the pipeline
def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):
    # Make predictions
    train_preds = pipeline.predict(X_train)
    valid_preds = pipeline.predict(X_valid)
    
    # Calculate scores
    scores = {
        "Training MAE": mean_absolute_error(y_train, train_preds),
        "Valid MAE": mean_absolute_error(y_valid, valid_preds),
        "Training RMSLE": root_mean_squared_log_error(y_train, train_preds),
        "Valid RMSLE": root_mean_squared_log_error(y_valid, valid_preds),
        "Training R^2": pipeline.score(X_train, y_train),
        "Valid R^2": pipeline.score(X_valid, y_valid)
    }
    return scores

# Fit and evaluate pipeline
print(f"[INFO] Fitting model with one hot encoded values...")
model_one_hot_pipeline.fit(X_train, y_train)
print(f"[INFO] Evaluating model with one hot encoded values...")
pipeline_one_hot_scores = evaluate_pipeline(model_one_hot_pipeline, X_train, y_train, X_valid, y_valid)
print("[INFO] Pipeline with one hot encoding scores:")
pipeline_one_hot_scores

Out[7]:

[INFO] Fitting model with one hot encoded values...
[INFO] Evaluating model with one hot encoded values...
[INFO] Pipeline with one hot encoding scores:
CPU times: user 29min, sys: 23min 12s, total: 52min 13s
Wall time: 9min 14s

{'Training MAE': np.float64(2133.748251811842),
 'Valid MAE': np.float64(6176.810802667383),
 'Training RMSLE': np.float64(0.11021214524792695),
 'Valid RMSLE': np.float64(0.2539881442090813),
 'Training R^2': 0.9759312990258391,
 'Valid R^2': 0.870741470996933}

Predicting the Sale Price of Bulldozers using Machine Learning 🚜

Overview

6 Step Machine Learning Framework

1. Problem Definition

2. Data

3. Evaluation

4. Features

1. Importing the data and preparing it for modelling

1.1 Parsing dates

1.2 Sorting our DataFrame by saledate

1.3 Adding extra features to our DataFrame

1.4 Inspect values of other columns

2. Model driven data exploration

2.1 Inspecting the datatypes in our DataFrame

2.2 Converting strings to categories with pandas

2.3 Saving our preprocessed data (part 1)

2.4 Finding and filling missing values

2.5 Filling missing numerical values

2.6 Discussing possible ways to handle missing values

2.7 Filling missing categorical values with pandas

2.8 Saving our preprocessed data (part 2)

2.9 Fitting a machine learning model to our preprocessed data

2.10 A big (but fixable) mistake

3. Splitting data into the right train/validation sets

3.1 Trying to fit a model on our training data

3.2 Encoding categorical features as numbers using Scikit-Learn

3.3 Fitting a model to our preprocessed training data

4. Building an evaluation function

5. Tuning our model's hyperparameters

5.1 Making our modelling experiments faster (to speed up hyperparameter tuning)

5.2 Hyperparameter tuning with RandomizedSearchCV

5.3 Training a model with the best hyperparameters

5.4 Comparing our model's scores

6. Saving our best model to file

7. Making predictions on test data

7.1 Preprocessing the test data (to be in the same format as the training data)

8. Making a prediction on a custom sample

9. Finding the most important predictive features

Summary

Exercises

Extra-curriculum

Example Exercise Solutions

1. Fill the missing values in the numeric columns with the median using Scikit-Learn and see if that helps our best model's performance

2. Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's sklearn.pipeline.Pipeline

3. Try using another regression model/estimator on our preprocessed dataset and see how it goes

4. Try replacing the sklearn.preprocessing.OrdinalEncoder we used for the categorical variables with sklearn.preprocessing.OneHotEncoder

2. Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's `sklearn.pipeline.Pipeline`

4. Try replacing the `sklearn.preprocessing.OrdinalEncoder` we used for the categorical variables with `sklearn.preprocessing.OneHotEncoder`