CoCalc -- classification_with

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/structured_data/classification_with_tfdf.py
³⁵⁰⁷ views
1
"""
2
Title: Classification with TensorFlow Decision Forests
3
Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)
4
Date created: 2022/01/25
5
Last modified: 2022/01/25
6
Description: Using TensorFlow Decision Forests for structured data classification.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)
14
is a collection of state-of-the-art algorithms of Decision Forest models
15
that are compatible with Keras APIs.
16
The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),
17
[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),
18
and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),
19
and can be used for regression, classification, and ranking task.
20
For a beginner's guide to TensorFlow Decision Forests,
21
please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).
22

23

24
This example uses Gradient Boosted Trees model in binary classification of
25
structured data, and covers the following scenarios:
26

27
1. Build a decision forests model by specifying the input feature usage.
28
2. Implement a custom *Binary Target encoder* as a [Keras Preprocessing layer](https://keras.io/api/layers/preprocessing_layers/)
29
to encode the categorical features with respect to their target value co-occurrences,
30
and then use the encoded features to build a decision forests model.
31
3. Encode the categorical features as [embeddings](https://keras.io/api/layers/core_layers/embedding),
32
train these embeddings in a simple NN model, and then use the
33
trained embeddings as inputs to build decision forests model.
34

35
This example uses TensorFlow 2.7 or higher,
36
as well as [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests),
37
which you can install using the following command:
38

39
```python
40
pip install -U tensorflow_decision_forests
41
```
42
"""
43

44
"""
45
## Setup
46
"""
47

48
import math
49
import urllib
50
import numpy as np
51
import pandas as pd
52
import tensorflow as tf
53
from tensorflow import keras
54
from tensorflow.keras import layers
55
import tensorflow_decision_forests as tfdf
56

57
"""
58
## Prepare the data
59

60
This example uses the
61
[United States Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29)
62
provided by the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).
63
The task is binary classification to determine whether a person makes over 50K a year.
64

65
The dataset includes ~300K instances with 41 input features: 7 numerical features
66
and 34 categorical features.
67

68
First we load the data from the UCI Machine Learning Repository into a Pandas DataFrame.
69
"""
70

71
BASE_PATH = "https://kdd.ics.uci.edu/databases/census-income/census-income"
72
CSV_HEADER = [
73
    l.decode("utf-8").split(":")[0].replace(" ", "_")
74
    for l in urllib.request.urlopen(f"{BASE_PATH}.names")
75
    if not l.startswith(b"|")
76
][2:]
77
CSV_HEADER.append("income_level")
78

79
train_data = pd.read_csv(
80
    f"{BASE_PATH}.data.gz",
81
    header=None,
82
    names=CSV_HEADER,
83
)
84
test_data = pd.read_csv(
85
    f"{BASE_PATH}.test.gz",
86
    header=None,
87
    names=CSV_HEADER,
88
)
89

90
"""
91
## Define dataset metadata
92

93
Here, we define the metadata of the dataset that will be useful for encoding
94
the input features with respect to their types.
95
"""
96

97
# Target column name.
98
TARGET_COLUMN_NAME = "income_level"
99
# The labels of the target columns.
100
TARGET_LABELS = [" - 50000.", " 50000+."]
101
# Weight column name.
102
WEIGHT_COLUMN_NAME = "instance_weight"
103
# Numeric feature names.
104
NUMERIC_FEATURE_NAMES = [
105
    "age",
106
    "wage_per_hour",
107
    "capital_gains",
108
    "capital_losses",
109
    "dividends_from_stocks",
110
    "num_persons_worked_for_employer",
111
    "weeks_worked_in_year",
112
]
113
# Categorical features and their vocabulary lists.
114
CATEGORICAL_FEATURE_NAMES = [
115
    "class_of_worker",
116
    "detailed_industry_recode",
117
    "detailed_occupation_recode",
118
    "education",
119
    "enroll_in_edu_inst_last_wk",
120
    "marital_stat",
121
    "major_industry_code",
122
    "major_occupation_code",
123
    "race",
124
    "hispanic_origin",
125
    "sex",
126
    "member_of_a_labor_union",
127
    "reason_for_unemployment",
128
    "full_or_part_time_employment_stat",
129
    "tax_filer_stat",
130
    "region_of_previous_residence",
131
    "state_of_previous_residence",
132
    "detailed_household_and_family_stat",
133
    "detailed_household_summary_in_household",
134
    "migration_code-change_in_msa",
135
    "migration_code-change_in_reg",
136
    "migration_code-move_within_reg",
137
    "live_in_this_house_1_year_ago",
138
    "migration_prev_res_in_sunbelt",
139
    "family_members_under_18",
140
    "country_of_birth_father",
141
    "country_of_birth_mother",
142
    "country_of_birth_self",
143
    "citizenship",
144
    "own_business_or_self_employed",
145
    "fill_inc_questionnaire_for_veteran's_admin",
146
    "veterans_benefits",
147
    "year",
148
]
149

150

151
"""
152
Now we perform basic data preparation.
153
"""
154

155

156
def prepare_dataframe(dataframe):
157
    # Convert the target labels from string to integer.
158
    dataframe[TARGET_COLUMN_NAME] = dataframe[TARGET_COLUMN_NAME].map(
159
        TARGET_LABELS.index
160
    )
161
    # Cast the categorical features to string.
162
    for feature_name in CATEGORICAL_FEATURE_NAMES:
163
        dataframe[feature_name] = dataframe[feature_name].astype(str)
164

165

166
prepare_dataframe(train_data)
167
prepare_dataframe(test_data)
168

169
"""
170
Now let's show the shapes of the training and test dataframes, and display some instances.
171
"""
172

173
print(f"Train data shape: {train_data.shape}")
174
print(f"Test data shape: {test_data.shape}")
175
print(train_data.head().T)
176

177
"""
178
## Configure hyperparameters
179

180
You can find all the parameters of the Gradient Boosted Tree model in the
181
[documentation](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)
182
"""
183

184
# Maximum number of decision trees. The effective number of trained trees can be smaller if early stopping is enabled.
185
NUM_TREES = 250
186
# Minimum number of examples in a node.
187
MIN_EXAMPLES = 6
188
# Maximum depth of the tree. max_depth=1 means that all trees will be roots.
189
MAX_DEPTH = 5
190
# Ratio of the dataset (sampling without replacement) used to train individual trees for the random sampling method.
191
SUBSAMPLE = 0.65
192
# Control the sampling of the datasets used to train individual trees.
193
SAMPLING_METHOD = "RANDOM"
194
# Ratio of the training dataset used to monitor the training. Require to be >0 if early stopping is enabled.
195
VALIDATION_RATIO = 0.1
196

197
"""
198
## Implement a training and evaluation procedure
199

200
The `run_experiment()` method is responsible loading the train and test datasets,
201
training a given model, and evaluating the trained model.
202

203
Note that when training a Decision Forests model, only one epoch is needed to
204
read the full dataset. Any extra steps will result in unnecessary slower training.
205
Therefore, the default `num_epochs=1` is used in the `run_experiment()` method.
206
"""
207

208

209
def run_experiment(model, train_data, test_data, num_epochs=1, batch_size=None):
210
    train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
211
        train_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME
212
    )
213
    test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(
214
        test_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME
215
    )
216

217
    model.fit(train_dataset, epochs=num_epochs, batch_size=batch_size)
218
    _, accuracy = model.evaluate(test_dataset, verbose=0)
219
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")
220

221

222
"""
223
## Experiment 1: Decision Forests with raw features
224
"""
225

226
"""
227
### Specify model input feature usages
228

229
You can attach semantics to each feature to control how it is used by the model.
230
If not specified, the semantics are inferred from the representation type.
231
It is recommended to specify the [feature usages](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/FeatureUsage)
232
explicitly to avoid incorrect inferred semantics is incorrect.
233
For example, a categorical value identifier (integer) will be be inferred as numerical,
234
while it is semantically categorical.
235

236
For numerical features, you can set the `discretized` parameters to the number
237
of buckets by which the numerical feature should be discretized.
238
This makes the training faster but may lead to worse models.
239
"""
240

241

242
def specify_feature_usages():
243
    feature_usages = []
244

245
    for feature_name in NUMERIC_FEATURE_NAMES:
246
        feature_usage = tfdf.keras.FeatureUsage(
247
            name=feature_name, semantic=tfdf.keras.FeatureSemantic.NUMERICAL
248
        )
249
        feature_usages.append(feature_usage)
250

251
    for feature_name in CATEGORICAL_FEATURE_NAMES:
252
        feature_usage = tfdf.keras.FeatureUsage(
253
            name=feature_name, semantic=tfdf.keras.FeatureSemantic.CATEGORICAL
254
        )
255
        feature_usages.append(feature_usage)
256

257
    return feature_usages
258

259

260
"""
261
### Create a Gradient Boosted Trees model
262

263
When compiling a decision forests model, you may only provide extra evaluation metrics.
264
The loss is specified in the model construction,
265
and the optimizer is irrelevant to decision forests models.
266
"""
267

268

269
def create_gbt_model():
270
    # See all the model parameters in https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel
271
    gbt_model = tfdf.keras.GradientBoostedTreesModel(
272
        features=specify_feature_usages(),
273
        exclude_non_specified_features=True,
274
        num_trees=NUM_TREES,
275
        max_depth=MAX_DEPTH,
276
        min_examples=MIN_EXAMPLES,
277
        subsample=SUBSAMPLE,
278
        validation_ratio=VALIDATION_RATIO,
279
        task=tfdf.keras.Task.CLASSIFICATION,
280
    )
281

282
    gbt_model.compile(metrics=[keras.metrics.BinaryAccuracy(name="accuracy")])
283
    return gbt_model
284

285

286
"""
287
### Train and evaluate the model
288
"""
289

290
gbt_model = create_gbt_model()
291
run_experiment(gbt_model, train_data, test_data)
292

293
"""
294
### Inspect the model
295

296
The `model.summary()` method will display several types of information about
297
your decision trees model, model type, task, input features, and feature importance.
298
"""
299

300
print(gbt_model.summary())
301

302
"""
303
## Experiment 2: Decision Forests with target encoding
304

305
[Target encoding](https://dl.acm.org/doi/10.1145/507533.507538) is a common preprocessing
306
technique for categorical features that convert them into numerical features.
307
Using categorical features with high cardinality as-is may lead to overfitting.
308
Target encoding aims to replace each categorical feature value with one or more
309
numerical values that represent its co-occurrence with the target labels.
310

311
More precisely, given a categorical feature, the binary target encoder in this example
312
will produce three new numerical features:
313

314
1. `positive_frequency`: How many times each feature value occurred with a positive target label.
315
2. `negative_frequency`: How many times each feature value occurred with a negative target label.
316
3. `positive_probability`: The probability that the target label is positive,
317
given the feature value, which is computed as
318
`positive_frequency / (positive_frequency + negative_frequency + correction)`.
319
The `correction` term is added in to make the division more stable for rare categorical values.
320
The default value for `correction` is 1.0.
321

322

323

324
Note that target encoding is effective with models that cannot automatically
325
learn dense representations to categorical features, such as decision forests
326
or kernel methods. If neural network models are used, its recommended to
327
encode categorical features as embeddings.
328
"""
329

330
"""
331
### Implement Binary Target Encoder
332

333
For simplicity, we assume that the inputs for the `adapt` and `call` methods
334
are in the expected data types and shapes, so no validation logic is added.
335

336
It is recommended to pass the `vocabulary_size` of the categorical feature to the
337
`BinaryTargetEncoding` constructor. If not specified, it will be computed during
338
the `adapt()` method execution.
339
"""
340

341

342
class BinaryTargetEncoding(layers.Layer):
343
    def __init__(self, vocabulary_size=None, correction=1.0, **kwargs):
344
        super().__init__(**kwargs)
345
        self.vocabulary_size = vocabulary_size
346
        self.correction = correction
347

348
    def adapt(self, data):
349
        # data is expected to be an integer numpy array to a Tensor shape [num_exmples, 2].
350
        # This contains feature values for a given feature in the dataset, and target values.
351

352
        # Convert the data to a tensor.
353
        data = tf.convert_to_tensor(data)
354
        # Separate the feature values and target values
355
        feature_values = tf.cast(data[:, 0], tf.dtypes.int32)
356
        target_values = tf.cast(data[:, 1], tf.dtypes.bool)
357

358
        # Compute the vocabulary_size of not specified.
359
        if self.vocabulary_size is None:
360
            self.vocabulary_size = tf.unique(feature_values).y.shape[0]
361

362
        # Filter the data where the target label is positive.
363
        positive_indices = tf.where(condition=target_values)
364
        positive_feature_values = tf.gather_nd(
365
            params=feature_values, indices=positive_indices
366
        )
367
        # Compute how many times each feature value occurred with a positive target label.
368
        positive_frequency = tf.math.unsorted_segment_sum(
369
            data=tf.ones(
370
                shape=(positive_feature_values.shape[0], 1), dtype=tf.dtypes.float64
371
            ),
372
            segment_ids=positive_feature_values,
373
            num_segments=self.vocabulary_size,
374
        )
375

376
        # Filter the data where the target label is negative.
377
        negative_indices = tf.where(condition=tf.math.logical_not(target_values))
378
        negative_feature_values = tf.gather_nd(
379
            params=feature_values, indices=negative_indices
380
        )
381
        # Compute how many times each feature value occurred with a negative target label.
382
        negative_frequency = tf.math.unsorted_segment_sum(
383
            data=tf.ones(
384
                shape=(negative_feature_values.shape[0], 1), dtype=tf.dtypes.float64
385
            ),
386
            segment_ids=negative_feature_values,
387
            num_segments=self.vocabulary_size,
388
        )
389
        # Compute positive probability for the input feature values.
390
        positive_probability = positive_frequency / (
391
            positive_frequency + negative_frequency + self.correction
392
        )
393
        # Concatenate the computed statistics for traget_encoding.
394
        target_encoding_statistics = tf.cast(
395
            tf.concat(
396
                [positive_frequency, negative_frequency, positive_probability], axis=1
397
            ),
398
            dtype=tf.dtypes.float32,
399
        )
400
        self.target_encoding_statistics = tf.constant(target_encoding_statistics)
401

402
    def call(self, inputs):
403
        # inputs is expected to be an integer numpy array to a Tensor shape [num_exmples, 1].
404
        # This includes the feature values for a given feature in the dataset.
405

406
        # Raise an error if the target encoding statistics are not computed.
407
        if self.target_encoding_statistics == None:
408
            raise ValueError(
409
                f"You need to call the adapt method to compute target encoding statistics."
410
            )
411

412
        # Convert the inputs to a tensor.
413
        inputs = tf.convert_to_tensor(inputs)
414
        # Cast the inputs int64 a tensor.
415
        inputs = tf.cast(inputs, tf.dtypes.int64)
416
        # Lookup target encoding statistics for the input feature values.
417
        target_encoding_statistics = tf.cast(
418
            tf.gather_nd(self.target_encoding_statistics, inputs),
419
            dtype=tf.dtypes.float32,
420
        )
421
        return target_encoding_statistics
422

423

424
"""
425
Let's test the binary target encoder
426
"""
427

428
data = tf.constant(
429
    [
430
        [0, 1],
431
        [2, 0],
432
        [0, 1],
433
        [1, 1],
434
        [1, 1],
435
        [2, 0],
436
        [1, 0],
437
        [0, 1],
438
        [2, 1],
439
        [1, 0],
440
        [0, 1],
441
        [2, 0],
442
        [0, 1],
443
        [1, 1],
444
        [1, 1],
445
        [2, 0],
446
        [1, 0],
447
        [0, 1],
448
        [2, 0],
449
    ]
450
)
451

452
binary_target_encoder = BinaryTargetEncoding()
453
binary_target_encoder.adapt(data)
454
print(binary_target_encoder([[0], [1], [2]]))
455

456
"""
457
### Create model inputs
458
"""
459

460

461
def create_model_inputs():
462
    inputs = {}
463

464
    for feature_name in NUMERIC_FEATURE_NAMES:
465
        inputs[feature_name] = layers.Input(
466
            name=feature_name, shape=(), dtype=tf.float32
467
        )
468

469
    for feature_name in CATEGORICAL_FEATURE_NAMES:
470
        inputs[feature_name] = layers.Input(
471
            name=feature_name, shape=(), dtype=tf.string
472
        )
473

474
    return inputs
475

476

477
"""
478
### Implement a feature encoding with target encoding
479
"""
480

481

482
def create_target_encoder():
483
    inputs = create_model_inputs()
484
    target_values = train_data[[TARGET_COLUMN_NAME]].to_numpy()
485
    encoded_features = []
486
    for feature_name in inputs:
487
        if feature_name in CATEGORICAL_FEATURE_NAMES:
488
            # Get the vocabulary of the categorical feature.
489
            vocabulary = sorted(
490
                [str(value) for value in list(train_data[feature_name].unique())]
491
            )
492
            # Create a lookup to convert string values to an integer indices.
493
            # Since we are not using a mask token nor expecting any out of vocabulary
494
            # (oov) token, we set mask_token to None and  num_oov_indices to 0.
495
            lookup = layers.StringLookup(
496
                vocabulary=vocabulary, mask_token=None, num_oov_indices=0
497
            )
498
            # Convert the string input values into integer indices.
499
            value_indices = lookup(inputs[feature_name])
500
            # Prepare the data to adapt the target encoding.
501
            print("### Adapting target encoding for:", feature_name)
502
            feature_values = train_data[[feature_name]].to_numpy().astype(str)
503
            feature_value_indices = lookup(feature_values)
504
            data = tf.concat([feature_value_indices, target_values], axis=1)
505
            feature_encoder = BinaryTargetEncoding()
506
            feature_encoder.adapt(data)
507
            # Convert the feature value indices to target encoding representations.
508
            encoded_feature = feature_encoder(tf.expand_dims(value_indices, -1))
509
        else:
510
            # Expand the dimensions of the numerical input feature and use it as-is.
511
            encoded_feature = tf.expand_dims(inputs[feature_name], -1)
512
        # Add the encoded feature to the list.
513
        encoded_features.append(encoded_feature)
514
    # Concatenate all the encoded features.
515
    encoded_features = tf.concat(encoded_features, axis=1)
516
    # Create and return a Keras model with encoded features as outputs.
517
    return keras.Model(inputs=inputs, outputs=encoded_features)
518

519

520
"""
521
### Create a Gradient Boosted Trees model with a preprocessor
522

523
In this scenario, we use the target encoding as a preprocessor for the Gradient Boosted Tree model,
524
and let the model infer semantics of the input features.
525
"""
526

527

528
def create_gbt_with_preprocessor(preprocessor):
529
    gbt_model = tfdf.keras.GradientBoostedTreesModel(
530
        preprocessing=preprocessor,
531
        num_trees=NUM_TREES,
532
        max_depth=MAX_DEPTH,
533
        min_examples=MIN_EXAMPLES,
534
        subsample=SUBSAMPLE,
535
        validation_ratio=VALIDATION_RATIO,
536
        task=tfdf.keras.Task.CLASSIFICATION,
537
    )
538

539
    gbt_model.compile(metrics=[keras.metrics.BinaryAccuracy(name="accuracy")])
540

541
    return gbt_model
542

543

544
"""
545
### Train and evaluate the model
546
"""
547

548
gbt_model = create_gbt_with_preprocessor(create_target_encoder())
549
run_experiment(gbt_model, train_data, test_data)
550

551
"""
552
## Experiment 3: Decision Forests with trained embeddings
553

554
In this scenario, we build an encoder model that codes the categorical
555
features to embeddings, where the size of the embedding for a given categorical
556
feature is the square root to the size of its vocabulary.
557

558
We train these embeddings in a simple NN model through backpropagation.
559
After the embedding encoder is trained, we used it as a preprocessor to the
560
input features of a Gradient Boosted Tree model.
561

562
Note that the embeddings and a decision forest model cannot be trained
563
synergically in one phase, since decision forest models do not train with backpropagation.
564
Rather, embeddings has to be trained in an initial phase,
565
and then used as static inputs to the decision forest model.
566
"""
567

568
"""
569
### Implement feature encoding with embeddings
570
"""
571

572

573
def create_embedding_encoder(size=None):
574
    inputs = create_model_inputs()
575
    encoded_features = []
576
    for feature_name in inputs:
577
        if feature_name in CATEGORICAL_FEATURE_NAMES:
578
            # Get the vocabulary of the categorical feature.
579
            vocabulary = sorted(
580
                [str(value) for value in list(train_data[feature_name].unique())]
581
            )
582
            # Create a lookup to convert string values to an integer indices.
583
            # Since we are not using a mask token nor expecting any out of vocabulary
584
            # (oov) token, we set mask_token to None and  num_oov_indices to 0.
585
            lookup = layers.StringLookup(
586
                vocabulary=vocabulary, mask_token=None, num_oov_indices=0
587
            )
588
            # Convert the string input values into integer indices.
589
            value_index = lookup(inputs[feature_name])
590
            # Create an embedding layer with the specified dimensions
591
            vocabulary_size = len(vocabulary)
592
            embedding_size = int(math.sqrt(vocabulary_size))
593
            feature_encoder = layers.Embedding(
594
                input_dim=len(vocabulary), output_dim=embedding_size
595
            )
596
            # Convert the index values to embedding representations.
597
            encoded_feature = feature_encoder(value_index)
598
        else:
599
            # Expand the dimensions of the numerical input feature and use it as-is.
600
            encoded_feature = tf.expand_dims(inputs[feature_name], -1)
601
        # Add the encoded feature to the list.
602
        encoded_features.append(encoded_feature)
603
    # Concatenate all the encoded features.
604
    encoded_features = layers.concatenate(encoded_features, axis=1)
605
    # Apply dropout.
606
    encoded_features = layers.Dropout(rate=0.25)(encoded_features)
607
    # Perform non-linearity projection.
608
    encoded_features = layers.Dense(
609
        units=size if size else encoded_features.shape[-1], activation="gelu"
610
    )(encoded_features)
611
    # Create and return a Keras model with encoded features as outputs.
612
    return keras.Model(inputs=inputs, outputs=encoded_features)
613

614

615
"""
616
### Build an NN model to train the embeddings
617
"""
618

619

620
def create_nn_model(encoder):
621
    inputs = create_model_inputs()
622
    embeddings = encoder(inputs)
623
    output = layers.Dense(units=1, activation="sigmoid")(embeddings)
624

625
    nn_model = keras.Model(inputs=inputs, outputs=output)
626
    nn_model.compile(
627
        optimizer=keras.optimizers.Adam(),
628
        loss=keras.losses.BinaryCrossentropy(),
629
        metrics=[keras.metrics.BinaryAccuracy("accuracy")],
630
    )
631
    return nn_model
632

633

634
embedding_encoder = create_embedding_encoder(size=64)
635
run_experiment(
636
    create_nn_model(embedding_encoder),
637
    train_data,
638
    test_data,
639
    num_epochs=5,
640
    batch_size=256,
641
)
642

643
"""
644
### Train and evaluate a Gradient Boosted Tree model with embeddings
645
"""
646

647
gbt_model = create_gbt_with_preprocessor(embedding_encoder)
648
run_experiment(gbt_model, train_data, test_data)
649

650
"""
651
## Concluding remarks
652

653
TensorFlow Decision Forests provide powerful models, especially with structured data.
654
In our experiments, the Gradient Boosted Tree model achieved 95.79% test accuracy.
655
When using the target encoding with categorical feature, the same model achieved 95.81% test accuracy.
656
When pretraining embeddings to be used as inputs to the Gradient Boosted Tree model,
657
we achieved 95.82% test accuracy.
658

659
Decision Forests can be used with Neural Networks, either by
660
1) using Neural Networks to learn useful representation of the input data,
661
and then using Decision Forests for the supervised learning task, or by
662
2) creating an ensemble of both Decision Forests and Neural Network models.
663

664
Note that TensorFlow Decision Forests does not (yet) support hardware accelerators.
665
All training and inference is done on the CPU.
666
Besides, Decision Forests require a finite dataset that fits in memory
667
for their training procedures. However, there are diminishing returns
668
for increasing the size of the dataset, and Decision Forests algorithms
669
arguably need fewer examples for convergence than large Neural Network models.
670
"""
671

672
Product

Resources

Company