CoCalc -- wide_deep_cross

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/structured_data/wide_deep_cross_networks.py
³⁵⁰⁷ views
1
"""
2
Title: Structured data learning with Wide, Deep, and Cross networks
3
Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)
4
Date created: 2020/12/31
5
Last modified: 2025/01/03
6
Description: Using Wide & Deep and Deep & Cross networks for structured data classification.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
This example demonstrates how to do structured data classification using the two modeling
14
techniques:
15

16
1. [Wide & Deep](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) models
17
2. [Deep & Cross](https://arxiv.org/abs/1708.05123) models
18

19
Note that this example should be run with TensorFlow 2.5 or higher.
20
"""
21

22
"""
23
## The dataset
24

25
This example uses the [Covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset from the UCI
26
Machine Learning Repository. The task is to predict forest cover type from cartographic variables.
27
The dataset includes 506,011 instances with 12 input features: 10 numerical features and 2
28
categorical features. Each instance is categorized into 1 of 7 classes.
29
"""
30

31
"""
32
## Setup
33
"""
34

35
import os
36

37
# Only the TensorFlow backend supports string inputs.
38
os.environ["KERAS_BACKEND"] = "tensorflow"
39

40
import math
41
import numpy as np
42
import pandas as pd
43
from tensorflow import data as tf_data
44
import keras
45
from keras import layers
46

47
"""
48
## Prepare the data
49

50
First, let's load the dataset from the UCI Machine Learning Repository into a Pandas
51
DataFrame:
52
"""
53

54
data_url = (
55
    "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"
56
)
57
raw_data = pd.read_csv(data_url, header=None)
58
print(f"Dataset shape: {raw_data.shape}")
59
raw_data.head()
60

61
"""
62
The two categorical features in the dataset are binary-encoded.
63
We will convert this dataset representation to the typical representation, where each
64
categorical feature is represented as a single integer value.
65
"""
66

67
soil_type_values = [f"soil_type_{idx+1}" for idx in range(40)]
68
wilderness_area_values = [f"area_type_{idx+1}" for idx in range(4)]
69

70
soil_type = raw_data.loc[:, 14:53].apply(
71
    lambda x: soil_type_values[0::1][x.to_numpy().nonzero()[0][0]], axis=1
72
)
73
wilderness_area = raw_data.loc[:, 10:13].apply(
74
    lambda x: wilderness_area_values[0::1][x.to_numpy().nonzero()[0][0]], axis=1
75
)
76

77
CSV_HEADER = [
78
    "Elevation",
79
    "Aspect",
80
    "Slope",
81
    "Horizontal_Distance_To_Hydrology",
82
    "Vertical_Distance_To_Hydrology",
83
    "Horizontal_Distance_To_Roadways",
84
    "Hillshade_9am",
85
    "Hillshade_Noon",
86
    "Hillshade_3pm",
87
    "Horizontal_Distance_To_Fire_Points",
88
    "Wilderness_Area",
89
    "Soil_Type",
90
    "Cover_Type",
91
]
92

93
data = pd.concat(
94
    [raw_data.loc[:, 0:9], wilderness_area, soil_type, raw_data.loc[:, 54]],
95
    axis=1,
96
    ignore_index=True,
97
)
98
data.columns = CSV_HEADER
99

100
# Convert the target label indices into a range from 0 to 6 (there are 7 labels in total).
101
data["Cover_Type"] = data["Cover_Type"] - 1
102

103
print(f"Dataset shape: {data.shape}")
104
data.head().T
105

106
"""
107
The shape of the DataFrame shows there are 13 columns per sample
108
(12 for the features and 1 for the target label).
109

110
Let's split the data into training (85%) and test (15%) sets.
111
"""
112

113
train_splits = []
114
test_splits = []
115

116
for _, group_data in data.groupby("Cover_Type"):
117
    random_selection = np.random.rand(len(group_data.index)) <= 0.85
118
    train_splits.append(group_data[random_selection])
119
    test_splits.append(group_data[~random_selection])
120

121
train_data = pd.concat(train_splits).sample(frac=1).reset_index(drop=True)
122
test_data = pd.concat(test_splits).sample(frac=1).reset_index(drop=True)
123

124
print(f"Train split size: {len(train_data.index)}")
125
print(f"Test split size: {len(test_data.index)}")
126

127
"""
128
Next, store the training and test data in separate CSV files.
129
"""
130

131
train_data_file = "train_data.csv"
132
test_data_file = "test_data.csv"
133

134
train_data.to_csv(train_data_file, index=False)
135
test_data.to_csv(test_data_file, index=False)
136

137
"""
138
## Define dataset metadata
139

140
Here, we define the metadata of the dataset that will be useful for reading and parsing
141
the data into input features, and encoding the input features with respect to their types.
142
"""
143

144
TARGET_FEATURE_NAME = "Cover_Type"
145

146
TARGET_FEATURE_LABELS = ["0", "1", "2", "3", "4", "5", "6"]
147

148
NUMERIC_FEATURE_NAMES = [
149
    "Aspect",
150
    "Elevation",
151
    "Hillshade_3pm",
152
    "Hillshade_9am",
153
    "Hillshade_Noon",
154
    "Horizontal_Distance_To_Fire_Points",
155
    "Horizontal_Distance_To_Hydrology",
156
    "Horizontal_Distance_To_Roadways",
157
    "Slope",
158
    "Vertical_Distance_To_Hydrology",
159
]
160

161
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
162
    "Soil_Type": list(data["Soil_Type"].unique()),
163
    "Wilderness_Area": list(data["Wilderness_Area"].unique()),
164
}
165

166
CATEGORICAL_FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys())
167

168
FEATURE_NAMES = NUMERIC_FEATURE_NAMES + CATEGORICAL_FEATURE_NAMES
169

170
COLUMN_DEFAULTS = [
171
    [0] if feature_name in NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME] else ["NA"]
172
    for feature_name in CSV_HEADER
173
]
174

175
NUM_CLASSES = len(TARGET_FEATURE_LABELS)
176

177
"""
178
## Experiment setup
179

180
Next, let's define an input function that reads and parses the file, then converts features
181
and labels into a[`tf.data.Dataset`](https://www.tensorflow.org/guide/datasets)
182
for training or evaluation.
183
"""
184

185

186
# To convert the datasets elements to from OrderedDict to Dictionary
187
def process(features, target):
188
    return dict(features), target
189

190

191
def get_dataset_from_csv(csv_file_path, batch_size, shuffle=False):
192
    dataset = tf_data.experimental.make_csv_dataset(
193
        csv_file_path,
194
        batch_size=batch_size,
195
        column_names=CSV_HEADER,
196
        column_defaults=COLUMN_DEFAULTS,
197
        label_name=TARGET_FEATURE_NAME,
198
        num_epochs=1,
199
        header=True,
200
        shuffle=shuffle,
201
    ).map(process)
202
    return dataset.cache()
203

204

205
"""
206
Here we configure the parameters and implement the procedure for running a training and
207
evaluation experiment given a model.
208
"""
209

210
learning_rate = 0.001
211
dropout_rate = 0.1
212
batch_size = 265
213
num_epochs = 1
214

215
hidden_units = [32, 32]
216

217

218
def run_experiment(model):
219
    model.compile(
220
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
221
        loss=keras.losses.SparseCategoricalCrossentropy(),
222
        metrics=[keras.metrics.SparseCategoricalAccuracy()],
223
    )
224

225
    train_dataset = get_dataset_from_csv(train_data_file, batch_size, shuffle=True)
226

227
    test_dataset = get_dataset_from_csv(test_data_file, batch_size)
228

229
    print("Start training the model...")
230
    history = model.fit(train_dataset, epochs=num_epochs)
231
    print("Model training finished")
232

233
    _, accuracy = model.evaluate(test_dataset, verbose=0)
234

235
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")
236

237

238
"""
239
## Create model inputs
240

241
Now, define the inputs for the models as a dictionary, where the key is the feature name,
242
and the value is a `keras.layers.Input` tensor with the corresponding feature shape
243
and data type.
244
"""
245

246

247
def create_model_inputs():
248
    inputs = {}
249
    for feature_name in FEATURE_NAMES:
250
        if feature_name in NUMERIC_FEATURE_NAMES:
251
            inputs[feature_name] = layers.Input(
252
                name=feature_name, shape=(), dtype="float32"
253
            )
254
        else:
255
            inputs[feature_name] = layers.Input(
256
                name=feature_name, shape=(), dtype="string"
257
            )
258
    return inputs
259

260

261
"""
262
## Encode features
263

264
We create two representations of our input features: sparse and dense:
265
1. In the **sparse** representation, the categorical features are encoded with one-hot
266
encoding using the `CategoryEncoding` layer. This representation can be useful for the
267
model to *memorize* particular feature values to make certain predictions.
268
2. In the **dense** representation, the categorical features are encoded with
269
low-dimensional embeddings using the `Embedding` layer. This representation helps
270
the model to *generalize* well to unseen feature combinations.
271
"""
272

273

274
def encode_inputs(inputs, use_embedding=False):
275
    encoded_features = []
276
    for feature_name in inputs:
277
        if feature_name in CATEGORICAL_FEATURE_NAMES:
278
            vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
279
            # Create a lookup to convert string values to an integer indices.
280
            # Since we are not using a mask token nor expecting any out of vocabulary
281
            # (oov) token, we set mask_token to None and  num_oov_indices to 0.
282
            lookup = layers.StringLookup(
283
                vocabulary=vocabulary,
284
                mask_token=None,
285
                num_oov_indices=0,
286
                output_mode="int" if use_embedding else "binary",
287
            )
288
            if use_embedding:
289
                # Convert the string input values into integer indices.
290
                encoded_feature = lookup(inputs[feature_name])
291
                embedding_dims = int(math.sqrt(len(vocabulary)))
292
                # Create an embedding layer with the specified dimensions.
293
                embedding = layers.Embedding(
294
                    input_dim=len(vocabulary), output_dim=embedding_dims
295
                )
296
                # Convert the index values to embedding representations.
297
                encoded_feature = embedding(encoded_feature)
298
            else:
299
                # Convert the string input values into a one hot encoding.
300
                encoded_feature = lookup(
301
                    keras.ops.expand_dims(inputs[feature_name], -1)
302
                )
303
        else:
304
            # Use the numerical features as-is.
305
            encoded_feature = keras.ops.expand_dims(inputs[feature_name], -1)
306

307
        encoded_features.append(encoded_feature)
308

309
    all_features = layers.concatenate(encoded_features)
310
    return all_features
311

312

313
"""
314
## Experiment 1: a baseline model
315

316
In the first experiment, let's create a multi-layer feed-forward network,
317
where the categorical features are one-hot encoded.
318
"""
319

320

321
def create_baseline_model():
322
    inputs = create_model_inputs()
323
    features = encode_inputs(inputs)
324

325
    for units in hidden_units:
326
        features = layers.Dense(units)(features)
327
        features = layers.BatchNormalization()(features)
328
        features = layers.ReLU()(features)
329
        features = layers.Dropout(dropout_rate)(features)
330

331
    outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(features)
332
    model = keras.Model(inputs=inputs, outputs=outputs)
333
    return model
334

335

336
baseline_model = create_baseline_model()
337
keras.utils.plot_model(baseline_model, show_shapes=True, rankdir="LR")
338

339
"""
340
Let's run it:
341
"""
342

343
run_experiment(baseline_model)
344

345
"""
346
The baseline linear model achieves ~76% test accuracy.
347
"""
348

349
"""
350
## Experiment 2: Wide & Deep model
351

352
In the second experiment, we create a Wide & Deep model. The wide part of the model
353
a linear model, while the deep part of the model is a multi-layer feed-forward network.
354

355
Use the sparse representation of the input features in the wide part of the model and the
356
dense representation of the input features for the deep part of the model.
357

358
Note that every input features contributes to both parts of the model with different
359
representations.
360
"""
361

362

363
def create_wide_and_deep_model():
364
    inputs = create_model_inputs()
365
    wide = encode_inputs(inputs)
366
    wide = layers.BatchNormalization()(wide)
367

368
    deep = encode_inputs(inputs, use_embedding=True)
369
    for units in hidden_units:
370
        deep = layers.Dense(units)(deep)
371
        deep = layers.BatchNormalization()(deep)
372
        deep = layers.ReLU()(deep)
373
        deep = layers.Dropout(dropout_rate)(deep)
374

375
    merged = layers.concatenate([wide, deep])
376
    outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(merged)
377
    model = keras.Model(inputs=inputs, outputs=outputs)
378
    return model
379

380

381
wide_and_deep_model = create_wide_and_deep_model()
382
keras.utils.plot_model(wide_and_deep_model, show_shapes=True, rankdir="LR")
383

384
"""
385
Let's run it:
386
"""
387

388
run_experiment(wide_and_deep_model)
389

390
"""
391
The wide and deep model achieves ~79% test accuracy.
392
"""
393

394
"""
395
## Experiment 3: Deep & Cross model
396

397
In the third experiment, we create a Deep & Cross model. The deep part of this model
398
is the same as the deep part created in the previous experiment. The key idea of
399
the cross part is to apply explicit feature crossing in an efficient way,
400
where the degree of cross features grows with layer depth.
401
"""
402

403

404
def create_deep_and_cross_model():
405
    inputs = create_model_inputs()
406
    x0 = encode_inputs(inputs, use_embedding=True)
407

408
    cross = x0
409
    for _ in hidden_units:
410
        units = cross.shape[-1]
411
        x = layers.Dense(units)(cross)
412
        cross = x0 * x + cross
413
    cross = layers.BatchNormalization()(cross)
414

415
    deep = x0
416
    for units in hidden_units:
417
        deep = layers.Dense(units)(deep)
418
        deep = layers.BatchNormalization()(deep)
419
        deep = layers.ReLU()(deep)
420
        deep = layers.Dropout(dropout_rate)(deep)
421

422
    merged = layers.concatenate([cross, deep])
423
    outputs = layers.Dense(units=NUM_CLASSES, activation="softmax")(merged)
424
    model = keras.Model(inputs=inputs, outputs=outputs)
425
    return model
426

427

428
deep_and_cross_model = create_deep_and_cross_model()
429
keras.utils.plot_model(deep_and_cross_model, show_shapes=True, rankdir="LR")
430

431
"""
432
Let's run it:
433
"""
434

435
run_experiment(deep_and_cross_model)
436

437
"""
438
The deep and cross model achieves ~81% test accuracy.
439
"""
440

441
"""
442
## Conclusion
443

444
You can use Keras Preprocessing Layers to easily handle categorical features
445
with different encoding mechanisms, including one-hot encoding and feature embedding.
446
In addition, different model architectures — like wide, deep, and cross networks
447
— have different advantages, with respect to different dataset properties.
448
You can explore using them independently or combining them to achieve the best result
449
for your dataset.
450
"""
451

452
Product

Resources

Company