Path: blob/master/examples/structured_data/classification_with_tfdf.py
3507 views
"""1Title: Classification with TensorFlow Decision Forests2Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)3Date created: 2022/01/254Last modified: 2022/01/255Description: Using TensorFlow Decision Forests for structured data classification.6Accelerator: GPU7"""89"""10## Introduction1112[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)13is a collection of state-of-the-art algorithms of Decision Forest models14that are compatible with Keras APIs.15The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),16[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),17and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),18and can be used for regression, classification, and ranking task.19For a beginner's guide to TensorFlow Decision Forests,20please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).212223This example uses Gradient Boosted Trees model in binary classification of24structured data, and covers the following scenarios:25261. Build a decision forests model by specifying the input feature usage.272. Implement a custom *Binary Target encoder* as a [Keras Preprocessing layer](https://keras.io/api/layers/preprocessing_layers/)28to encode the categorical features with respect to their target value co-occurrences,29and then use the encoded features to build a decision forests model.303. Encode the categorical features as [embeddings](https://keras.io/api/layers/core_layers/embedding),31train these embeddings in a simple NN model, and then use the32trained embeddings as inputs to build decision forests model.3334This example uses TensorFlow 2.7 or higher,35as well as [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests),36which you can install using the following command:3738```python39pip install -U tensorflow_decision_forests40```41"""4243"""44## Setup45"""4647import math48import urllib49import numpy as np50import pandas as pd51import tensorflow as tf52from tensorflow import keras53from tensorflow.keras import layers54import tensorflow_decision_forests as tfdf5556"""57## Prepare the data5859This example uses the60[United States Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29)61provided by the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).62The task is binary classification to determine whether a person makes over 50K a year.6364The dataset includes ~300K instances with 41 input features: 7 numerical features65and 34 categorical features.6667First we load the data from the UCI Machine Learning Repository into a Pandas DataFrame.68"""6970BASE_PATH = "https://kdd.ics.uci.edu/databases/census-income/census-income"71CSV_HEADER = [72l.decode("utf-8").split(":")[0].replace(" ", "_")73for l in urllib.request.urlopen(f"{BASE_PATH}.names")74if not l.startswith(b"|")75][2:]76CSV_HEADER.append("income_level")7778train_data = pd.read_csv(79f"{BASE_PATH}.data.gz",80header=None,81names=CSV_HEADER,82)83test_data = pd.read_csv(84f"{BASE_PATH}.test.gz",85header=None,86names=CSV_HEADER,87)8889"""90## Define dataset metadata9192Here, we define the metadata of the dataset that will be useful for encoding93the input features with respect to their types.94"""9596# Target column name.97TARGET_COLUMN_NAME = "income_level"98# The labels of the target columns.99TARGET_LABELS = [" - 50000.", " 50000+."]100# Weight column name.101WEIGHT_COLUMN_NAME = "instance_weight"102# Numeric feature names.103NUMERIC_FEATURE_NAMES = [104"age",105"wage_per_hour",106"capital_gains",107"capital_losses",108"dividends_from_stocks",109"num_persons_worked_for_employer",110"weeks_worked_in_year",111]112# Categorical features and their vocabulary lists.113CATEGORICAL_FEATURE_NAMES = [114"class_of_worker",115"detailed_industry_recode",116"detailed_occupation_recode",117"education",118"enroll_in_edu_inst_last_wk",119"marital_stat",120"major_industry_code",121"major_occupation_code",122"race",123"hispanic_origin",124"sex",125"member_of_a_labor_union",126"reason_for_unemployment",127"full_or_part_time_employment_stat",128"tax_filer_stat",129"region_of_previous_residence",130"state_of_previous_residence",131"detailed_household_and_family_stat",132"detailed_household_summary_in_household",133"migration_code-change_in_msa",134"migration_code-change_in_reg",135"migration_code-move_within_reg",136"live_in_this_house_1_year_ago",137"migration_prev_res_in_sunbelt",138"family_members_under_18",139"country_of_birth_father",140"country_of_birth_mother",141"country_of_birth_self",142"citizenship",143"own_business_or_self_employed",144"fill_inc_questionnaire_for_veteran's_admin",145"veterans_benefits",146"year",147]148149150"""151Now we perform basic data preparation.152"""153154155def prepare_dataframe(dataframe):156# Convert the target labels from string to integer.157dataframe[TARGET_COLUMN_NAME] = dataframe[TARGET_COLUMN_NAME].map(158TARGET_LABELS.index159)160# Cast the categorical features to string.161for feature_name in CATEGORICAL_FEATURE_NAMES:162dataframe[feature_name] = dataframe[feature_name].astype(str)163164165prepare_dataframe(train_data)166prepare_dataframe(test_data)167168"""169Now let's show the shapes of the training and test dataframes, and display some instances.170"""171172print(f"Train data shape: {train_data.shape}")173print(f"Test data shape: {test_data.shape}")174print(train_data.head().T)175176"""177## Configure hyperparameters178179You can find all the parameters of the Gradient Boosted Tree model in the180[documentation](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)181"""182183# Maximum number of decision trees. The effective number of trained trees can be smaller if early stopping is enabled.184NUM_TREES = 250185# Minimum number of examples in a node.186MIN_EXAMPLES = 6187# Maximum depth of the tree. max_depth=1 means that all trees will be roots.188MAX_DEPTH = 5189# Ratio of the dataset (sampling without replacement) used to train individual trees for the random sampling method.190SUBSAMPLE = 0.65191# Control the sampling of the datasets used to train individual trees.192SAMPLING_METHOD = "RANDOM"193# Ratio of the training dataset used to monitor the training. Require to be >0 if early stopping is enabled.194VALIDATION_RATIO = 0.1195196"""197## Implement a training and evaluation procedure198199The `run_experiment()` method is responsible loading the train and test datasets,200training a given model, and evaluating the trained model.201202Note that when training a Decision Forests model, only one epoch is needed to203read the full dataset. Any extra steps will result in unnecessary slower training.204Therefore, the default `num_epochs=1` is used in the `run_experiment()` method.205"""206207208def run_experiment(model, train_data, test_data, num_epochs=1, batch_size=None):209train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(210train_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME211)212test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(213test_data, label=TARGET_COLUMN_NAME, weight=WEIGHT_COLUMN_NAME214)215216model.fit(train_dataset, epochs=num_epochs, batch_size=batch_size)217_, accuracy = model.evaluate(test_dataset, verbose=0)218print(f"Test accuracy: {round(accuracy * 100, 2)}%")219220221"""222## Experiment 1: Decision Forests with raw features223"""224225"""226### Specify model input feature usages227228You can attach semantics to each feature to control how it is used by the model.229If not specified, the semantics are inferred from the representation type.230It is recommended to specify the [feature usages](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/FeatureUsage)231explicitly to avoid incorrect inferred semantics is incorrect.232For example, a categorical value identifier (integer) will be be inferred as numerical,233while it is semantically categorical.234235For numerical features, you can set the `discretized` parameters to the number236of buckets by which the numerical feature should be discretized.237This makes the training faster but may lead to worse models.238"""239240241def specify_feature_usages():242feature_usages = []243244for feature_name in NUMERIC_FEATURE_NAMES:245feature_usage = tfdf.keras.FeatureUsage(246name=feature_name, semantic=tfdf.keras.FeatureSemantic.NUMERICAL247)248feature_usages.append(feature_usage)249250for feature_name in CATEGORICAL_FEATURE_NAMES:251feature_usage = tfdf.keras.FeatureUsage(252name=feature_name, semantic=tfdf.keras.FeatureSemantic.CATEGORICAL253)254feature_usages.append(feature_usage)255256return feature_usages257258259"""260### Create a Gradient Boosted Trees model261262When compiling a decision forests model, you may only provide extra evaluation metrics.263The loss is specified in the model construction,264and the optimizer is irrelevant to decision forests models.265"""266267268def create_gbt_model():269# See all the model parameters in https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel270gbt_model = tfdf.keras.GradientBoostedTreesModel(271features=specify_feature_usages(),272exclude_non_specified_features=True,273num_trees=NUM_TREES,274max_depth=MAX_DEPTH,275min_examples=MIN_EXAMPLES,276subsample=SUBSAMPLE,277validation_ratio=VALIDATION_RATIO,278task=tfdf.keras.Task.CLASSIFICATION,279)280281gbt_model.compile(metrics=[keras.metrics.BinaryAccuracy(name="accuracy")])282return gbt_model283284285"""286### Train and evaluate the model287"""288289gbt_model = create_gbt_model()290run_experiment(gbt_model, train_data, test_data)291292"""293### Inspect the model294295The `model.summary()` method will display several types of information about296your decision trees model, model type, task, input features, and feature importance.297"""298299print(gbt_model.summary())300301"""302## Experiment 2: Decision Forests with target encoding303304[Target encoding](https://dl.acm.org/doi/10.1145/507533.507538) is a common preprocessing305technique for categorical features that convert them into numerical features.306Using categorical features with high cardinality as-is may lead to overfitting.307Target encoding aims to replace each categorical feature value with one or more308numerical values that represent its co-occurrence with the target labels.309310More precisely, given a categorical feature, the binary target encoder in this example311will produce three new numerical features:3123131. `positive_frequency`: How many times each feature value occurred with a positive target label.3142. `negative_frequency`: How many times each feature value occurred with a negative target label.3153. `positive_probability`: The probability that the target label is positive,316given the feature value, which is computed as317`positive_frequency / (positive_frequency + negative_frequency + correction)`.318The `correction` term is added in to make the division more stable for rare categorical values.319The default value for `correction` is 1.0.320321322323Note that target encoding is effective with models that cannot automatically324learn dense representations to categorical features, such as decision forests325or kernel methods. If neural network models are used, its recommended to326encode categorical features as embeddings.327"""328329"""330### Implement Binary Target Encoder331332For simplicity, we assume that the inputs for the `adapt` and `call` methods333are in the expected data types and shapes, so no validation logic is added.334335It is recommended to pass the `vocabulary_size` of the categorical feature to the336`BinaryTargetEncoding` constructor. If not specified, it will be computed during337the `adapt()` method execution.338"""339340341class BinaryTargetEncoding(layers.Layer):342def __init__(self, vocabulary_size=None, correction=1.0, **kwargs):343super().__init__(**kwargs)344self.vocabulary_size = vocabulary_size345self.correction = correction346347def adapt(self, data):348# data is expected to be an integer numpy array to a Tensor shape [num_exmples, 2].349# This contains feature values for a given feature in the dataset, and target values.350351# Convert the data to a tensor.352data = tf.convert_to_tensor(data)353# Separate the feature values and target values354feature_values = tf.cast(data[:, 0], tf.dtypes.int32)355target_values = tf.cast(data[:, 1], tf.dtypes.bool)356357# Compute the vocabulary_size of not specified.358if self.vocabulary_size is None:359self.vocabulary_size = tf.unique(feature_values).y.shape[0]360361# Filter the data where the target label is positive.362positive_indices = tf.where(condition=target_values)363positive_feature_values = tf.gather_nd(364params=feature_values, indices=positive_indices365)366# Compute how many times each feature value occurred with a positive target label.367positive_frequency = tf.math.unsorted_segment_sum(368data=tf.ones(369shape=(positive_feature_values.shape[0], 1), dtype=tf.dtypes.float64370),371segment_ids=positive_feature_values,372num_segments=self.vocabulary_size,373)374375# Filter the data where the target label is negative.376negative_indices = tf.where(condition=tf.math.logical_not(target_values))377negative_feature_values = tf.gather_nd(378params=feature_values, indices=negative_indices379)380# Compute how many times each feature value occurred with a negative target label.381negative_frequency = tf.math.unsorted_segment_sum(382data=tf.ones(383shape=(negative_feature_values.shape[0], 1), dtype=tf.dtypes.float64384),385segment_ids=negative_feature_values,386num_segments=self.vocabulary_size,387)388# Compute positive probability for the input feature values.389positive_probability = positive_frequency / (390positive_frequency + negative_frequency + self.correction391)392# Concatenate the computed statistics for traget_encoding.393target_encoding_statistics = tf.cast(394tf.concat(395[positive_frequency, negative_frequency, positive_probability], axis=1396),397dtype=tf.dtypes.float32,398)399self.target_encoding_statistics = tf.constant(target_encoding_statistics)400401def call(self, inputs):402# inputs is expected to be an integer numpy array to a Tensor shape [num_exmples, 1].403# This includes the feature values for a given feature in the dataset.404405# Raise an error if the target encoding statistics are not computed.406if self.target_encoding_statistics == None:407raise ValueError(408f"You need to call the adapt method to compute target encoding statistics."409)410411# Convert the inputs to a tensor.412inputs = tf.convert_to_tensor(inputs)413# Cast the inputs int64 a tensor.414inputs = tf.cast(inputs, tf.dtypes.int64)415# Lookup target encoding statistics for the input feature values.416target_encoding_statistics = tf.cast(417tf.gather_nd(self.target_encoding_statistics, inputs),418dtype=tf.dtypes.float32,419)420return target_encoding_statistics421422423"""424Let's test the binary target encoder425"""426427data = tf.constant(428[429[0, 1],430[2, 0],431[0, 1],432[1, 1],433[1, 1],434[2, 0],435[1, 0],436[0, 1],437[2, 1],438[1, 0],439[0, 1],440[2, 0],441[0, 1],442[1, 1],443[1, 1],444[2, 0],445[1, 0],446[0, 1],447[2, 0],448]449)450451binary_target_encoder = BinaryTargetEncoding()452binary_target_encoder.adapt(data)453print(binary_target_encoder([[0], [1], [2]]))454455"""456### Create model inputs457"""458459460def create_model_inputs():461inputs = {}462463for feature_name in NUMERIC_FEATURE_NAMES:464inputs[feature_name] = layers.Input(465name=feature_name, shape=(), dtype=tf.float32466)467468for feature_name in CATEGORICAL_FEATURE_NAMES:469inputs[feature_name] = layers.Input(470name=feature_name, shape=(), dtype=tf.string471)472473return inputs474475476"""477### Implement a feature encoding with target encoding478"""479480481def create_target_encoder():482inputs = create_model_inputs()483target_values = train_data[[TARGET_COLUMN_NAME]].to_numpy()484encoded_features = []485for feature_name in inputs:486if feature_name in CATEGORICAL_FEATURE_NAMES:487# Get the vocabulary of the categorical feature.488vocabulary = sorted(489[str(value) for value in list(train_data[feature_name].unique())]490)491# Create a lookup to convert string values to an integer indices.492# Since we are not using a mask token nor expecting any out of vocabulary493# (oov) token, we set mask_token to None and num_oov_indices to 0.494lookup = layers.StringLookup(495vocabulary=vocabulary, mask_token=None, num_oov_indices=0496)497# Convert the string input values into integer indices.498value_indices = lookup(inputs[feature_name])499# Prepare the data to adapt the target encoding.500print("### Adapting target encoding for:", feature_name)501feature_values = train_data[[feature_name]].to_numpy().astype(str)502feature_value_indices = lookup(feature_values)503data = tf.concat([feature_value_indices, target_values], axis=1)504feature_encoder = BinaryTargetEncoding()505feature_encoder.adapt(data)506# Convert the feature value indices to target encoding representations.507encoded_feature = feature_encoder(tf.expand_dims(value_indices, -1))508else:509# Expand the dimensions of the numerical input feature and use it as-is.510encoded_feature = tf.expand_dims(inputs[feature_name], -1)511# Add the encoded feature to the list.512encoded_features.append(encoded_feature)513# Concatenate all the encoded features.514encoded_features = tf.concat(encoded_features, axis=1)515# Create and return a Keras model with encoded features as outputs.516return keras.Model(inputs=inputs, outputs=encoded_features)517518519"""520### Create a Gradient Boosted Trees model with a preprocessor521522In this scenario, we use the target encoding as a preprocessor for the Gradient Boosted Tree model,523and let the model infer semantics of the input features.524"""525526527def create_gbt_with_preprocessor(preprocessor):528gbt_model = tfdf.keras.GradientBoostedTreesModel(529preprocessing=preprocessor,530num_trees=NUM_TREES,531max_depth=MAX_DEPTH,532min_examples=MIN_EXAMPLES,533subsample=SUBSAMPLE,534validation_ratio=VALIDATION_RATIO,535task=tfdf.keras.Task.CLASSIFICATION,536)537538gbt_model.compile(metrics=[keras.metrics.BinaryAccuracy(name="accuracy")])539540return gbt_model541542543"""544### Train and evaluate the model545"""546547gbt_model = create_gbt_with_preprocessor(create_target_encoder())548run_experiment(gbt_model, train_data, test_data)549550"""551## Experiment 3: Decision Forests with trained embeddings552553In this scenario, we build an encoder model that codes the categorical554features to embeddings, where the size of the embedding for a given categorical555feature is the square root to the size of its vocabulary.556557We train these embeddings in a simple NN model through backpropagation.558After the embedding encoder is trained, we used it as a preprocessor to the559input features of a Gradient Boosted Tree model.560561Note that the embeddings and a decision forest model cannot be trained562synergically in one phase, since decision forest models do not train with backpropagation.563Rather, embeddings has to be trained in an initial phase,564and then used as static inputs to the decision forest model.565"""566567"""568### Implement feature encoding with embeddings569"""570571572def create_embedding_encoder(size=None):573inputs = create_model_inputs()574encoded_features = []575for feature_name in inputs:576if feature_name in CATEGORICAL_FEATURE_NAMES:577# Get the vocabulary of the categorical feature.578vocabulary = sorted(579[str(value) for value in list(train_data[feature_name].unique())]580)581# Create a lookup to convert string values to an integer indices.582# Since we are not using a mask token nor expecting any out of vocabulary583# (oov) token, we set mask_token to None and num_oov_indices to 0.584lookup = layers.StringLookup(585vocabulary=vocabulary, mask_token=None, num_oov_indices=0586)587# Convert the string input values into integer indices.588value_index = lookup(inputs[feature_name])589# Create an embedding layer with the specified dimensions590vocabulary_size = len(vocabulary)591embedding_size = int(math.sqrt(vocabulary_size))592feature_encoder = layers.Embedding(593input_dim=len(vocabulary), output_dim=embedding_size594)595# Convert the index values to embedding representations.596encoded_feature = feature_encoder(value_index)597else:598# Expand the dimensions of the numerical input feature and use it as-is.599encoded_feature = tf.expand_dims(inputs[feature_name], -1)600# Add the encoded feature to the list.601encoded_features.append(encoded_feature)602# Concatenate all the encoded features.603encoded_features = layers.concatenate(encoded_features, axis=1)604# Apply dropout.605encoded_features = layers.Dropout(rate=0.25)(encoded_features)606# Perform non-linearity projection.607encoded_features = layers.Dense(608units=size if size else encoded_features.shape[-1], activation="gelu"609)(encoded_features)610# Create and return a Keras model with encoded features as outputs.611return keras.Model(inputs=inputs, outputs=encoded_features)612613614"""615### Build an NN model to train the embeddings616"""617618619def create_nn_model(encoder):620inputs = create_model_inputs()621embeddings = encoder(inputs)622output = layers.Dense(units=1, activation="sigmoid")(embeddings)623624nn_model = keras.Model(inputs=inputs, outputs=output)625nn_model.compile(626optimizer=keras.optimizers.Adam(),627loss=keras.losses.BinaryCrossentropy(),628metrics=[keras.metrics.BinaryAccuracy("accuracy")],629)630return nn_model631632633embedding_encoder = create_embedding_encoder(size=64)634run_experiment(635create_nn_model(embedding_encoder),636train_data,637test_data,638num_epochs=5,639batch_size=256,640)641642"""643### Train and evaluate a Gradient Boosted Tree model with embeddings644"""645646gbt_model = create_gbt_with_preprocessor(embedding_encoder)647run_experiment(gbt_model, train_data, test_data)648649"""650## Concluding remarks651652TensorFlow Decision Forests provide powerful models, especially with structured data.653In our experiments, the Gradient Boosted Tree model achieved 95.79% test accuracy.654When using the target encoding with categorical feature, the same model achieved 95.81% test accuracy.655When pretraining embeddings to be used as inputs to the Gradient Boosted Tree model,656we achieved 95.82% test accuracy.657658Decision Forests can be used with Neural Networks, either by6591) using Neural Networks to learn useful representation of the input data,660and then using Decision Forests for the supervised learning task, or by6612) creating an ensemble of both Decision Forests and Neural Network models.662663Note that TensorFlow Decision Forests does not (yet) support hardware accelerators.664All training and inference is done on the CPU.665Besides, Decision Forests require a finite dataset that fits in memory666for their training procedures. However, there are diminishing returns667for increasing the size of the dataset, and Decision Forests algorithms668arguably need fewer examples for convergence than large Neural Network models.669"""670671672