Path: blob/master/section-2-data-science-and-ml-tools/car-sales-data-manufacture.ipynb
874 views
Creating fake data for car_sales (to make it a bit bigger)
This notebook will manufacture data for the car_sales dataframe to make it usable to explain different techniques for missing data and converting things to numbers.
Create fake "Make" data
Create fake "Colour" data
Create fake Odometer (KM) data
Create fake "Doors" data
Create fake "Price" data
Create base dataframe with manufactured data
Adjust the price column
For the price column:
Generate random numbers between the certain values
If the Odometer reading is above 100K, multiply price by 0.75
If the Odometer reading is above 150K, multiply price by 0.6
If the Odometer reading is above 200K, multiply price by 0.5
If the Make column is BMW, multiply price by 1.5 + 2500
If the Make column is Toyota, multuply price by 1.2
If the Make is Nissan, multiply price by 1.1
If the Make is Honda, add $1000 to price
NEXT:
Drop some values at random (to manufacture missing data)
Build a random forest model to predict (this will involve changing categories to numerical data)
Make missing data in car_sales_extended
What we want to do
Remove some rows values or replace them at random
E.g. replace strings with empty strings ("")
And numbers with NaN or something similar...
Want to keep the number of samples the same, order the same, just put some holes in it
One way to do it would be to generate 50 random integers for each column and then drop/replace the indicies.