📚 The CoCalc Library - books, templates and other resources
License: OTHER
Kernel: Python [conda env:py37]
In [1]:
Representing Data and Engineering Features
Categorical Variables
One-Hot-Encoding (Dummy variables)
In [2]:
Out[2]:
Checking string-encoded categorical data
In [3]:
Out[3]:
Male 21790
Female 10771
Name: gender, dtype: int64
In [4]:
Out[4]:
Original features:
['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']
Features after get_dummies:
['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K']
In [5]:
Out[5]:
In [6]:
Out[6]:
X.shape: (32561, 44) y.shape: (32561,)
In [7]:
Out[7]:
Test score: 0.81
Numbers Can Encode Categoricals
In [8]:
Out[8]:
In [9]:
Out[9]:
In [10]:
Out[10]:
OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn
In [11]:
Out[11]:
[[1. 0. 0. 0. 0. 1.]
[0. 1. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 1.]
[0. 1. 0. 1. 0. 0.]]
In [12]:
Out[12]:
['x0_0' 'x0_1' 'x0_2' 'x1_box' 'x1_fox' 'x1_socks']
In [13]:
Out[13]:
In [14]:
In [15]:
Out[15]:
(24420, 44)
/home/andy/checkout/scikit-learn/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/home/andy/checkout/scikit-learn/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/home/andy/checkout/scikit-learn/sklearn/pipeline.py:605: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
res = transformer.transform(X)
In [16]:
Out[16]:
Test score: 0.81
/home/andy/checkout/scikit-learn/sklearn/pipeline.py:605: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
res = transformer.transform(X)
In [17]:
Out[17]:
OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=False)
Convenient ColumnTransformer creation with make_columntransformer
In [18]:
Binning, Discretization, Linear Models, and Trees
In [19]:
Out[19]:
<matplotlib.legend.Legend at 0x7f7416ef7358>
In [20]:
In [21]:
Out[21]:
bin edges:
[array([-2.967, -2.378, -1.789, -1.2 , -0.612, -0.023, 0.566, 1.155,
1.744, 2.333, 2.921])]
In [22]:
Out[22]:
<120x10 sparse matrix of type '<class 'numpy.float64'>'
with 120 stored elements in Compressed Sparse Row format>
In [23]:
Out[23]:
[[-0.753]
[ 2.704]
[ 1.392]
[ 0.592]
[-2.064]
[-2.064]
[-2.651]
[ 2.197]
[ 0.607]
[ 1.248]]
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])
In [24]:
In [25]:
Out[25]:
Text(0.5, 0, 'Input feature')
Interactions and Polynomials
In [26]:
Out[26]:
(120, 11)
In [27]:
Out[27]:
[<matplotlib.lines.Line2D at 0x7f7416d6d2e8>]
In [28]:
Out[28]:
(120, 20)
In [29]:
Out[29]:
<matplotlib.legend.Legend at 0x7f7416ce1ef0>
In [30]:
In [31]:
Out[31]:
X_poly.shape: (120, 10)
In [32]:
Out[32]:
Entries of X:
[[-0.753]
[ 2.704]
[ 1.392]
[ 0.592]
[-2.064]]
Entries of X_poly:
[[ -0.753 0.567 -0.427 0.321 -0.242 0.182 -0.137
0.103 -0.078 0.058]
[ 2.704 7.313 19.777 53.482 144.632 391.125 1057.714
2860.36 7735.232 20918.278]
[ 1.392 1.938 2.697 3.754 5.226 7.274 10.125
14.094 19.618 27.307]
[ 0.592 0.35 0.207 0.123 0.073 0.043 0.025
0.015 0.009 0.005]
[ -2.064 4.26 -8.791 18.144 -37.448 77.289 -159.516
329.222 -679.478 1402.367]]
In [33]:
Out[33]:
Polynomial feature names:
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']
In [34]:
Out[34]:
<matplotlib.legend.Legend at 0x7f7416c433c8>
In [35]:
Out[35]:
<matplotlib.legend.Legend at 0x7f7416c22f60>
In [36]:
In [37]:
Out[37]:
X_train.shape: (379, 13)
X_train_poly.shape: (379, 105)
In [38]:
Out[38]:
Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
In [39]:
Out[39]:
Score without interactions: 0.621
Score with interactions: 0.753
In [40]:
Out[40]:
Score without interactions: 0.788
Score with interactions: 0.761
Univariate Nonlinear Transformations
In [41]:
In [42]:
Out[42]:
Number of feature appearances:
[28 38 68 48 61 59 45 56 37 40 35 34 36 26 23 26 27 21 23 23 18 21 10 9
17 9 7 14 12 7 3 8 4 5 5 3 4 2 4 1 1 3 2 5 3 8 2 5
2 1 2 3 3 2 2 3 3 0 1 2 1 0 0 3 1 0 0 0 1 3 0 1
0 2 0 1 1 0 0 0 0 1 0 0 2 2 0 1 1 0 0 0 0 1 1 0
0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
In [43]:
Out[43]:
Text(0.5, 0, 'Value')
In [44]:
Out[44]:
Test score: 0.622
In [45]:
In [46]:
Out[46]:
Text(0.5, 0, 'Value')
In [47]:
Out[47]:
Test score: 0.875
Automatic Feature Selection
Univariate statistics
In [48]:
Out[48]:
X_train.shape: (284, 80)
X_train_selected.shape: (284, 40)
In [49]:
Out[49]:
[ True True True True True True True True True False True False
True True True True True True False False True True True True
True True True True True True False False False True False True
False False True False False False False True False False True False
False True False True False False False False False False True False
True False False False False True False True False False False False
True True False True False False False False]
([], <a list of 0 Text yticklabel objects>)
In [50]:
Out[50]:
Score with all features: 0.930
Score with only selected features: 0.940
Model-based Feature Selection
In [51]:
In [52]:
Out[52]:
X_train.shape: (284, 80)
X_train_l1.shape: (284, 40)
In [53]:
Out[53]:
([], <a list of 0 Text yticklabel objects>)
In [54]:
Out[54]:
Test score: 0.951
Iterative feature selection
In [55]:
Out[55]:
([], <a list of 0 Text yticklabel objects>)
In [56]:
Out[56]:
Test score: 0.951
In [57]:
Out[57]:
Test score: 0.951
Utilizing Expert Knowledge
In [58]:
In [59]:
Out[59]:
Citibike data:
starttime
2015-08-01 00:00:00 3
2015-08-01 03:00:00 0
2015-08-01 06:00:00 9
2015-08-01 09:00:00 41
2015-08-01 12:00:00 39
Freq: 3H, Name: one, dtype: int64
In [60]:
Out[60]:
Text(0, 0.5, 'Rentals')
In [61]:
In [62]:
In [63]:
Out[63]:
Test-set R^2: -0.04
In [64]:
Out[64]:
Test-set R^2: 0.60
In [65]:
Out[65]:
Test-set R^2: 0.84
In [66]:
Out[66]:
Test-set R^2: 0.13
In [67]:
In [68]:
Out[68]:
Test-set R^2: 0.62
In [69]:
Out[69]:
Test-set R^2: 0.85
In [70]:
In [71]:
In [72]:
Out[72]:
Text(0, 0.5, 'Feature magnitude')