Path: blob/master/examples/nlp/text_classification_from_scratch.py
3507 views
"""1Title: Text classification from scratch2Authors: Mark Omernick, Francois Chollet3Date created: 2019/11/064Last modified: 2020/05/175Description: Text sentiment classification starting from raw text files.6Accelerator: GPU7"""89"""10## Introduction1112This example shows how to do text classification starting from raw text (as13a set of text files on disk). We demonstrate the workflow on the IMDB sentiment14classification dataset (unprocessed version). We use the `TextVectorization` layer for15word splitting & indexing.16"""1718"""19## Setup20"""2122import os2324os.environ["KERAS_BACKEND"] = "tensorflow"2526import keras27import tensorflow as tf28import numpy as np29from keras import layers3031"""32## Load the data: IMDB movie review sentiment classification3334Let's download the data and inspect its structure.35"""3637"""shell38curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz39tar -xf aclImdb_v1.tar.gz40"""4142"""43The `aclImdb` folder contains a `train` and `test` subfolder:44"""4546"""shell47ls aclImdb48"""4950"""shell51ls aclImdb/test52"""5354"""shell55ls aclImdb/train56"""5758"""59The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of60which represents one review (either positive or negative):61"""6263"""shell64cat aclImdb/train/pos/6248_7.txt65"""6667"""68We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:69"""7071"""shell72rm -r aclImdb/train/unsup73"""7475"""76You can use the utility `keras.utils.text_dataset_from_directory` to77generate a labeled `tf.data.Dataset` object from a set of text files on disk filed78into class-specific folders.7980Let's use it to generate the training, validation, and test datasets. The validation81and training datasets are generated from two subsets of the `train` directory, with 20%82of samples going to the validation dataset and 80% going to the training dataset.8384Having a validation dataset in addition to the test dataset is useful for tuning85hyperparameters, such as the model architecture, for which the test dataset should not86be used.8788Before putting the model out into the real world however, it should be retrained using all89available training data (without creating a validation dataset), so its performance is maximized.9091When using the `validation_split` & `subset` arguments, make sure to either specify a92random seed, or to pass `shuffle=False`, so that the validation & training splits you93get have no overlap.9495"""9697batch_size = 3298raw_train_ds = keras.utils.text_dataset_from_directory(99"aclImdb/train",100batch_size=batch_size,101validation_split=0.2,102subset="training",103seed=1337,104)105raw_val_ds = keras.utils.text_dataset_from_directory(106"aclImdb/train",107batch_size=batch_size,108validation_split=0.2,109subset="validation",110seed=1337,111)112raw_test_ds = keras.utils.text_dataset_from_directory(113"aclImdb/test", batch_size=batch_size114)115116print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")117print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")118print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")119120"""121Let's preview a few samples:122"""123124# It's important to take a look at your raw data to ensure your normalization125# and tokenization will work as expected. We can do that by taking a few126# examples from the training set and looking at them.127# This is one of the places where eager execution shines:128# we can just evaluate these tensors using .numpy()129# instead of needing to evaluate them in a Session/Graph context.130for text_batch, label_batch in raw_train_ds.take(1):131for i in range(5):132print(text_batch.numpy()[i])133print(label_batch.numpy()[i])134135"""136## Prepare the data137138In particular, we remove `<br />` tags.139"""140141import string142import re143144145# Having looked at our data above, we see that the raw text contains HTML break146# tags of the form '<br />'. These tags will not be removed by the default147# standardizer (which doesn't strip HTML). Because of this, we will need to148# create a custom standardization function.149def custom_standardization(input_data):150lowercase = tf.strings.lower(input_data)151stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")152return tf.strings.regex_replace(153stripped_html, f"[{re.escape(string.punctuation)}]", ""154)155156157# Model constants.158max_features = 20000159embedding_dim = 128160sequence_length = 500161162# Now that we have our custom standardization, we can instantiate our text163# vectorization layer. We are using this layer to normalize, split, and map164# strings to integers, so we set our 'output_mode' to 'int'.165# Note that we're using the default split function,166# and the custom standardization defined above.167# We also set an explicit maximum sequence length, since the CNNs later in our168# model won't support ragged sequences.169vectorize_layer = keras.layers.TextVectorization(170standardize=custom_standardization,171max_tokens=max_features,172output_mode="int",173output_sequence_length=sequence_length,174)175176# Now that the vectorize_layer has been created, call `adapt` on a text-only177# dataset to create the vocabulary. You don't have to batch, but for very large178# datasets this means you're not keeping spare copies of the dataset in memory.179180# Let's make a text-only dataset (no labels):181text_ds = raw_train_ds.map(lambda x, y: x)182# Let's call `adapt`:183vectorize_layer.adapt(text_ds)184185"""186## Two options to vectorize the data187188There are 2 ways we can use our text vectorization layer:189190**Option 1: Make it part of the model**, so as to obtain a model that processes raw191strings, like this:192"""193194"""195196```python197text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')198x = vectorize_layer(text_input)199x = layers.Embedding(max_features + 1, embedding_dim)(x)200...201```202203**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then204feed it into a model that expects integer sequences as inputs.205206An important difference between the two is that option 2 enables you to do207**asynchronous CPU processing and buffering** of your data when training on GPU.208So if you're training the model on GPU, you probably want to go with this option to get209the best performance. This is what we will do below.210211If we were to export our model to production, we'd ship a model that accepts raw212strings as input, like in the code snippet for option 1 above. This can be done after213training. We do this in the last section.214215216"""217218219def vectorize_text(text, label):220text = tf.expand_dims(text, -1)221return vectorize_layer(text), label222223224# Vectorize the data.225train_ds = raw_train_ds.map(vectorize_text)226val_ds = raw_val_ds.map(vectorize_text)227test_ds = raw_test_ds.map(vectorize_text)228229# Do async prefetching / buffering of the data for best performance on GPU.230train_ds = train_ds.cache().prefetch(buffer_size=10)231val_ds = val_ds.cache().prefetch(buffer_size=10)232test_ds = test_ds.cache().prefetch(buffer_size=10)233234"""235## Build a model236237We choose a simple 1D convnet starting with an `Embedding` layer.238"""239240# A integer input for vocab indices.241inputs = keras.Input(shape=(None,), dtype="int64")242243# Next, we add a layer to map those vocab indices into a space of dimensionality244# 'embedding_dim'.245x = layers.Embedding(max_features, embedding_dim)(inputs)246x = layers.Dropout(0.5)(x)247248# Conv1D + global max pooling249x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)250x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)251x = layers.GlobalMaxPooling1D()(x)252253# We add a vanilla hidden layer:254x = layers.Dense(128, activation="relu")(x)255x = layers.Dropout(0.5)(x)256257# We project onto a single unit output layer, and squash it with a sigmoid:258predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)259260model = keras.Model(inputs, predictions)261262# Compile the model with binary crossentropy loss and an adam optimizer.263model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])264265"""266## Train the model267"""268269epochs = 3270271# Fit the model using the train and test datasets.272model.fit(train_ds, validation_data=val_ds, epochs=epochs)273274"""275## Evaluate the model on the test set276"""277278model.evaluate(test_ds)279280"""281## Make an end-to-end model282283If you want to obtain a model capable of processing raw strings, you can simply284create a new model (using the weights we just trained):285"""286287# A string input288inputs = keras.Input(shape=(1,), dtype="string")289# Turn strings into vocab indices290indices = vectorize_layer(inputs)291# Turn vocab indices into predictions292outputs = model(indices)293294# Our end to end model295end_to_end_model = keras.Model(inputs, outputs)296end_to_end_model.compile(297loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]298)299300# Test it with `raw_test_ds`, which yields raw strings301end_to_end_model.evaluate(raw_test_ds)302303304