Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/structured_data/wide_deep_cross_networks.py
3507 views
1
"""
2
Title: Structured data learning with Wide, Deep, and Cross networks
3
Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)
4
Date created: 2020/12/31
5
Last modified: 2025/01/03
6
Description: Using Wide & Deep and Deep & Cross networks for structured data classification.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
This example demonstrates how to do structured data classification using the two modeling
14
techniques:
15
16
1. [Wide & Deep](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) models
17
2. [Deep & Cross](https://arxiv.org/abs/1708.05123) models
18
19
Note that this example should be run with TensorFlow 2.5 or higher.
20
"""
21
22
"""
23
## The dataset
24
25
This example uses the [Covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset from the UCI
26
Machine Learning Repository. The task is to predict forest cover type from cartographic variables.
27
The dataset includes 506,011 instances with 12 input features: 10 numerical features and 2
28
categorical features. Each instance is categorized into 1 of 7 classes.
29
"""
30
31
"""
32
## Setup
33
"""
34
35
import os
36
37
# Only the TensorFlow backend supports string inputs.
38
os.environ["KERAS_BACKEND"] = "tensorflow"
39
40
import math
41
import numpy as np
42
import pandas as pd
43
from tensorflow import data as tf_data
44
import keras
45
from keras import layers
46
47
"""
48
## Prepare the data
49
50
First, let's load the dataset from the UCI Machine Learning Repository into a Pandas
51
DataFrame:
52
"""
53
54
data_url = (
55
"https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"
56
)
57
raw_data = pd.read_csv(data_url, header=None)
58
print(f"Dataset shape: {raw_data.shape}")
59
raw_data.head()
60
61
"""
62
The two categorical features in the dataset are binary-encoded.
63
We will convert this dataset representation to the typical representation, where each
64
categorical feature is represented as a single integer value.
65
"""
66
67
soil_type_values = [f"soil_type_{idx+1}" for idx in range(40)]
68
wilderness_area_values = [f"area_type_{idx+1}" for idx in range(4)]
69
70
soil_type = raw_data.loc[:, 14:53].apply(
71
lambda x: soil_type_values[0::1][x.to_numpy().nonzero()[0][0]], axis=1
72
)
73
wilderness_area = raw_data.loc[:, 10:13].apply(
74
lambda x: wilderness_area_values[0::1][x.to_numpy().nonzero()[0][0]], axis=1
75
)
76
77
CSV_HEADER = [
78
"Elevation",
79
"Aspect",
80
"Slope",
81
"Horizontal_Distance_To_Hydrology",
82
"Vertical_Distance_To_Hydrology",
83
"Horizontal_Distance_To_Roadways",
84
"Hillshade_9am",
85
"Hillshade_Noon",
86
"Hillshade_3pm",
87
"Horizontal_Distance_To_Fire_Points",
88
"Wilderness_Area",
89
"Soil_Type",
90
"Cover_Type",
91
]
92
93
data = pd.concat(
94
[raw_data.loc[:, 0:9], wilderness_area, soil_type, raw_data.loc[:, 54]],
95
axis=1,
96
ignore_index=True,
97
)
98
data.columns = CSV_HEADER
99
100
# Convert the target label indices into a range from 0 to 6 (there are 7 labels in total).
101
data["Cover_Type"] = data["Cover_Type"] - 1
102
103
print(f"Dataset shape: {data.shape}")
104
data.head().T
105
106
"""
107
The shape of the DataFrame shows there are 13 columns per sample
108
(12 for the features and 1 for the target label).
109
110
Let's split the data into training (85%) and test (15%) sets.
111
"""
112
113
train_splits = []
114
test_splits = []
115
116
for _, group_data in data.groupby("Cover_Type"):
117
random_selection = np.random.rand(len(group_data.index)) <= 0.85
118
train_splits.append(group_data[random_selection])
119
test_splits.append(group_data[~random_selection])
120
121
train_data = pd.concat(train_splits).sample(frac=1).reset_index(drop=True)
122
test_data = pd.concat(test_splits).sample(frac=1).reset_index(drop=True)
123
124
print(f"Train split size: {len(train_data.index)}")
125
print(f"Test split size: {len(test_data.index)}")
126
127
"""
128
Next, store the training and test data in separate CSV files.
129
"""
130
131
train_data_file = "train_data.csv"
132
test_data_file = "test_data.csv"
133
134
train_data.to_csv(train_data_file, index=False)
135
test_data.to_csv(test_data_file, index=False)
136
137
"""
138
## Define dataset metadata
139
140
Here, we define the metadata of the dataset that will be useful for reading and parsing
141
the data into input features, and encoding the input features with respect to their types.
142
"""
143
144
TARGET_FEATURE_NAME = "Cover_Type"
145
146
TARGET_FEATURE_LABELS = ["0", "1", "2", "3", "4", "5", "6"]
147
148
NUMERIC_FEATURE_NAMES = [
149
"Aspect",
150
"Elevation",
151
"Hillshade_3pm",
152
"Hillshade_9am",
153
"Hillshade_Noon",
154
"Horizontal_Distance_To_Fire_Points",
155
"Horizontal_Distance_To_Hydrology",
156
"Horizontal_Distance_To_Roadways",
157
"Slope",
158
"Vertical_Distance_To_Hydrology",
159
]
160
161
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
162
"Soil_Type": list(data["Soil_Type"].unique()),
163
"Wilderness_Area": list(data["Wilderness_Area"].unique()),
164
}
165
166
CATEGORICAL_FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys())
167
168
FEATURE_NAMES = NUMERIC_FEATURE_NAMES + CATEGORICAL_FEATURE_NAMES
169
170
COLUMN_DEFAULTS = [
171
[0] if feature_name in NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME] else ["NA"]
172
for feature_name in CSV_HEADER
173
]
174
175
NUM_CLASSES = len(TARGET_FEATURE_LABELS)
176
177
"""
178
## Experiment setup
179
180
Next, let's define an input function that reads and parses the file, then converts features
181
and labels into a[`tf.data.Dataset`](https://www.tensorflow.org/guide/datasets)
182
for training or evaluation.
183
"""
184
185
186
# To convert the datasets elements to from OrderedDict to Dictionary
187
def process(features, target):
188
return dict(features), target
189
190
191
def get_dataset_from_csv(csv_file_path, batch_size, shuffle=False):
192
dataset = tf_data.experimental.make_csv_dataset(
193
csv_file_path,
194
batch_size=batch_size,
195
column_names=CSV_HEADER,
196
column_defaults=COLUMN_DEFAULTS,
197
label_name=TARGET_FEATURE_NAME,
198
num_epochs=1,
199
header=True,
200
shuffle=shuffle,
201
).map(process)
202
return dataset.cache()
203
204
205
"""
206
Here we configure the parameters and implement the procedure for running a training and
207
evaluation experiment given a model.
208
"""
209
210
learning_rate = 0.001
211
dropout_rate = 0.1
212
batch_size = 265
213
num_epochs = 1
214
215
hidden_units = [32, 32]
216
217
218
def run_experiment(model):
219
model.compile(
220
optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
221
loss=keras.losses.SparseCategoricalCrossentropy(),
222
metrics=[keras.metrics.SparseCategoricalAccuracy()],
223
)
224
225
train_dataset = get_dataset_from_csv(train_data_file, batch_size, shuffle=True)
226
227
test_dataset = get_dataset_from_csv(test_data_file, batch_size)
228
229
print("Start training the model...")
230
history = model.fit(train_dataset, epochs=num_epochs)
231
print("Model training finished")
232
233
_, accuracy = model.evaluate(test_dataset, verbose=0)
234
235
print(f"Test accuracy: {round(accuracy * 100, 2)}%")
236
237
238
"""
239
## Create model inputs
240
241
Now, define the inputs for the models as a dictionary, where the key is the feature name,
242
and the value is a `keras.layers.Input` tensor with the corresponding feature shape
243
and data type.
244
"""
245
246
247
def create_model_inputs():
248
inputs = {}
249
for feature_name in FEATURE_NAMES:
250
if feature_name in NUMERIC_FEATURE_NAMES:
251
inputs[feature_name] = layers.Input(
252
name=feature_name, shape=(), dtype="float32"
253
)
254
else:
255
inputs[feature_name] = layers.Input(
256
name=feature_name, shape=(), dtype="string"
257
)
258
return inputs
259
260
261
"""
262
## Encode features
263
264
We create two representations of our input features: sparse and dense:
265
1. In the **sparse** representation, the categorical features are encoded with one-hot
266
encoding using the `CategoryEncoding` layer. This representation can be useful for the
267
model to *memorize* particular feature values to make certain predictions.
268
2. In the **dense** representation, the categorical features are encoded with
269
low-dimensional embeddings using the `Embedding` layer. This representation helps
270
the model to *generalize* well to unseen feature combinations.
271
"""
272
273
274
def encode_inputs(inputs, use_embedding=False):
275
encoded_features = []
276
for feature_name in inputs:
277
if feature_name in CATEGORICAL_FEATURE_NAMES:
278
vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
279
# Create a lookup to convert string values to an integer indices.
280
# Since we are not using a mask token nor expecting any out of vocabulary
281
# (oov) token, we set mask_token to None and num_oov_indices to 0.
282
lookup = layers.StringLookup(
283
vocabulary=vocabulary,
284
mask_token=None,
285
num_oov_indices=0,
286
output_mode="int" if use_embedding else "binary",
287
)
288
if use_embedding:
289
# Convert the string input values into integer indices.
290
encoded_feature = lookup(inputs[feature_name])
291
embedding_dims = int(math.sqrt(len(vocabulary)))
292
# Create an embedding layer with the specified dimensions.
293
embedding = layers.Embedding(
294
input_dim=len(vocabulary), output_dim=embedding_dims
295
)
296
# Convert the index values to embedding representations.
297
encoded_feature = embedding(encoded_feature)
298
else:
299
# Convert the string input values into a one hot encoding.
300
encoded_feature = lookup(
301
keras.ops.expand_dims(inputs[feature_name], -1)
302
)
303
else:
304
# Use the numerical features as-is.
305
encoded_feature = keras.ops.expand_dims(inputs[feature_name], -1)
306
307
encoded_features.append(encoded_feature)
308
309
all_features = layers.concatenate(encoded_features)
310
return all_features
311
312
313
"""
314
## Experiment 1: a baseline model
315
316
In the first experiment, let's create a multi-layer feed-forward network,
317
where the categorical features are one-hot encoded.
318
"""
319
320
321
def create_baseline_model():
322
inputs = create_model_inputs()
323
features = encode_inputs(inputs)
324
325
for units in hidden_units:
326
features = layers.Dense(units)(features)
327
features = layers.BatchNormalization()(features)
328
features = layers.ReLU()(features)
329
features = layers.Dropout(dropout_rate)(features)
330
331
outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(features)
332
model = keras.Model(inputs=inputs, outputs=outputs)
333
return model
334
335
336
baseline_model = create_baseline_model()
337
keras.utils.plot_model(baseline_model, show_shapes=True, rankdir="LR")
338
339
"""
340
Let's run it:
341
"""
342
343
run_experiment(baseline_model)
344
345
"""
346
The baseline linear model achieves ~76% test accuracy.
347
"""
348
349
"""
350
## Experiment 2: Wide & Deep model
351
352
In the second experiment, we create a Wide & Deep model. The wide part of the model
353
a linear model, while the deep part of the model is a multi-layer feed-forward network.
354
355
Use the sparse representation of the input features in the wide part of the model and the
356
dense representation of the input features for the deep part of the model.
357
358
Note that every input features contributes to both parts of the model with different
359
representations.
360
"""
361
362
363
def create_wide_and_deep_model():
364
inputs = create_model_inputs()
365
wide = encode_inputs(inputs)
366
wide = layers.BatchNormalization()(wide)
367
368
deep = encode_inputs(inputs, use_embedding=True)
369
for units in hidden_units:
370
deep = layers.Dense(units)(deep)
371
deep = layers.BatchNormalization()(deep)
372
deep = layers.ReLU()(deep)
373
deep = layers.Dropout(dropout_rate)(deep)
374
375
merged = layers.concatenate([wide, deep])
376
outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(merged)
377
model = keras.Model(inputs=inputs, outputs=outputs)
378
return model
379
380
381
wide_and_deep_model = create_wide_and_deep_model()
382
keras.utils.plot_model(wide_and_deep_model, show_shapes=True, rankdir="LR")
383
384
"""
385
Let's run it:
386
"""
387
388
run_experiment(wide_and_deep_model)
389
390
"""
391
The wide and deep model achieves ~79% test accuracy.
392
"""
393
394
"""
395
## Experiment 3: Deep & Cross model
396
397
In the third experiment, we create a Deep & Cross model. The deep part of this model
398
is the same as the deep part created in the previous experiment. The key idea of
399
the cross part is to apply explicit feature crossing in an efficient way,
400
where the degree of cross features grows with layer depth.
401
"""
402
403
404
def create_deep_and_cross_model():
405
inputs = create_model_inputs()
406
x0 = encode_inputs(inputs, use_embedding=True)
407
408
cross = x0
409
for _ in hidden_units:
410
units = cross.shape[-1]
411
x = layers.Dense(units)(cross)
412
cross = x0 * x + cross
413
cross = layers.BatchNormalization()(cross)
414
415
deep = x0
416
for units in hidden_units:
417
deep = layers.Dense(units)(deep)
418
deep = layers.BatchNormalization()(deep)
419
deep = layers.ReLU()(deep)
420
deep = layers.Dropout(dropout_rate)(deep)
421
422
merged = layers.concatenate([cross, deep])
423
outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(merged)
424
model = keras.Model(inputs=inputs, outputs=outputs)
425
return model
426
427
428
deep_and_cross_model = create_deep_and_cross_model()
429
keras.utils.plot_model(deep_and_cross_model, show_shapes=True, rankdir="LR")
430
431
"""
432
Let's run it:
433
"""
434
435
run_experiment(deep_and_cross_model)
436
437
"""
438
The deep and cross model achieves ~81% test accuracy.
439
"""
440
441
"""
442
## Conclusion
443
444
You can use Keras Preprocessing Layers to easily handle categorical features
445
with different encoding mechanisms, including one-hot encoding and feature embedding.
446
In addition, different model architectures — like wide, deep, and cross networks
447
— have different advantages, with respect to different dataset properties.
448
You can explore using them independently or combining them to achieve the best result
449
for your dataset.
450
"""
451
452