Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/text_classification_from_scratch.py
3507 views
1
"""
2
Title: Text classification from scratch
3
Authors: Mark Omernick, Francois Chollet
4
Date created: 2019/11/06
5
Last modified: 2020/05/17
6
Description: Text sentiment classification starting from raw text files.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
This example shows how to do text classification starting from raw text (as
14
a set of text files on disk). We demonstrate the workflow on the IMDB sentiment
15
classification dataset (unprocessed version). We use the `TextVectorization` layer for
16
word splitting & indexing.
17
"""
18
19
"""
20
## Setup
21
"""
22
23
import os
24
25
os.environ["KERAS_BACKEND"] = "tensorflow"
26
27
import keras
28
import tensorflow as tf
29
import numpy as np
30
from keras import layers
31
32
"""
33
## Load the data: IMDB movie review sentiment classification
34
35
Let's download the data and inspect its structure.
36
"""
37
38
"""shell
39
curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
40
tar -xf aclImdb_v1.tar.gz
41
"""
42
43
"""
44
The `aclImdb` folder contains a `train` and `test` subfolder:
45
"""
46
47
"""shell
48
ls aclImdb
49
"""
50
51
"""shell
52
ls aclImdb/test
53
"""
54
55
"""shell
56
ls aclImdb/train
57
"""
58
59
"""
60
The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of
61
which represents one review (either positive or negative):
62
"""
63
64
"""shell
65
cat aclImdb/train/pos/6248_7.txt
66
"""
67
68
"""
69
We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:
70
"""
71
72
"""shell
73
rm -r aclImdb/train/unsup
74
"""
75
76
"""
77
You can use the utility `keras.utils.text_dataset_from_directory` to
78
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed
79
into class-specific folders.
80
81
Let's use it to generate the training, validation, and test datasets. The validation
82
and training datasets are generated from two subsets of the `train` directory, with 20%
83
of samples going to the validation dataset and 80% going to the training dataset.
84
85
Having a validation dataset in addition to the test dataset is useful for tuning
86
hyperparameters, such as the model architecture, for which the test dataset should not
87
be used.
88
89
Before putting the model out into the real world however, it should be retrained using all
90
available training data (without creating a validation dataset), so its performance is maximized.
91
92
When using the `validation_split` & `subset` arguments, make sure to either specify a
93
random seed, or to pass `shuffle=False`, so that the validation & training splits you
94
get have no overlap.
95
96
"""
97
98
batch_size = 32
99
raw_train_ds = keras.utils.text_dataset_from_directory(
100
"aclImdb/train",
101
batch_size=batch_size,
102
validation_split=0.2,
103
subset="training",
104
seed=1337,
105
)
106
raw_val_ds = keras.utils.text_dataset_from_directory(
107
"aclImdb/train",
108
batch_size=batch_size,
109
validation_split=0.2,
110
subset="validation",
111
seed=1337,
112
)
113
raw_test_ds = keras.utils.text_dataset_from_directory(
114
"aclImdb/test", batch_size=batch_size
115
)
116
117
print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
118
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
119
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")
120
121
"""
122
Let's preview a few samples:
123
"""
124
125
# It's important to take a look at your raw data to ensure your normalization
126
# and tokenization will work as expected. We can do that by taking a few
127
# examples from the training set and looking at them.
128
# This is one of the places where eager execution shines:
129
# we can just evaluate these tensors using .numpy()
130
# instead of needing to evaluate them in a Session/Graph context.
131
for text_batch, label_batch in raw_train_ds.take(1):
132
for i in range(5):
133
print(text_batch.numpy()[i])
134
print(label_batch.numpy()[i])
135
136
"""
137
## Prepare the data
138
139
In particular, we remove `<br />` tags.
140
"""
141
142
import string
143
import re
144
145
146
# Having looked at our data above, we see that the raw text contains HTML break
147
# tags of the form '<br />'. These tags will not be removed by the default
148
# standardizer (which doesn't strip HTML). Because of this, we will need to
149
# create a custom standardization function.
150
def custom_standardization(input_data):
151
lowercase = tf.strings.lower(input_data)
152
stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
153
return tf.strings.regex_replace(
154
stripped_html, f"[{re.escape(string.punctuation)}]", ""
155
)
156
157
158
# Model constants.
159
max_features = 20000
160
embedding_dim = 128
161
sequence_length = 500
162
163
# Now that we have our custom standardization, we can instantiate our text
164
# vectorization layer. We are using this layer to normalize, split, and map
165
# strings to integers, so we set our 'output_mode' to 'int'.
166
# Note that we're using the default split function,
167
# and the custom standardization defined above.
168
# We also set an explicit maximum sequence length, since the CNNs later in our
169
# model won't support ragged sequences.
170
vectorize_layer = keras.layers.TextVectorization(
171
standardize=custom_standardization,
172
max_tokens=max_features,
173
output_mode="int",
174
output_sequence_length=sequence_length,
175
)
176
177
# Now that the vectorize_layer has been created, call `adapt` on a text-only
178
# dataset to create the vocabulary. You don't have to batch, but for very large
179
# datasets this means you're not keeping spare copies of the dataset in memory.
180
181
# Let's make a text-only dataset (no labels):
182
text_ds = raw_train_ds.map(lambda x, y: x)
183
# Let's call `adapt`:
184
vectorize_layer.adapt(text_ds)
185
186
"""
187
## Two options to vectorize the data
188
189
There are 2 ways we can use our text vectorization layer:
190
191
**Option 1: Make it part of the model**, so as to obtain a model that processes raw
192
strings, like this:
193
"""
194
195
"""
196
197
```python
198
text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
199
x = vectorize_layer(text_input)
200
x = layers.Embedding(max_features + 1, embedding_dim)(x)
201
...
202
```
203
204
**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then
205
feed it into a model that expects integer sequences as inputs.
206
207
An important difference between the two is that option 2 enables you to do
208
**asynchronous CPU processing and buffering** of your data when training on GPU.
209
So if you're training the model on GPU, you probably want to go with this option to get
210
the best performance. This is what we will do below.
211
212
If we were to export our model to production, we'd ship a model that accepts raw
213
strings as input, like in the code snippet for option 1 above. This can be done after
214
training. We do this in the last section.
215
216
217
"""
218
219
220
def vectorize_text(text, label):
221
text = tf.expand_dims(text, -1)
222
return vectorize_layer(text), label
223
224
225
# Vectorize the data.
226
train_ds = raw_train_ds.map(vectorize_text)
227
val_ds = raw_val_ds.map(vectorize_text)
228
test_ds = raw_test_ds.map(vectorize_text)
229
230
# Do async prefetching / buffering of the data for best performance on GPU.
231
train_ds = train_ds.cache().prefetch(buffer_size=10)
232
val_ds = val_ds.cache().prefetch(buffer_size=10)
233
test_ds = test_ds.cache().prefetch(buffer_size=10)
234
235
"""
236
## Build a model
237
238
We choose a simple 1D convnet starting with an `Embedding` layer.
239
"""
240
241
# A integer input for vocab indices.
242
inputs = keras.Input(shape=(None,), dtype="int64")
243
244
# Next, we add a layer to map those vocab indices into a space of dimensionality
245
# 'embedding_dim'.
246
x = layers.Embedding(max_features, embedding_dim)(inputs)
247
x = layers.Dropout(0.5)(x)
248
249
# Conv1D + global max pooling
250
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
251
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
252
x = layers.GlobalMaxPooling1D()(x)
253
254
# We add a vanilla hidden layer:
255
x = layers.Dense(128, activation="relu")(x)
256
x = layers.Dropout(0.5)(x)
257
258
# We project onto a single unit output layer, and squash it with a sigmoid:
259
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)
260
261
model = keras.Model(inputs, predictions)
262
263
# Compile the model with binary crossentropy loss and an adam optimizer.
264
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
265
266
"""
267
## Train the model
268
"""
269
270
epochs = 3
271
272
# Fit the model using the train and test datasets.
273
model.fit(train_ds, validation_data=val_ds, epochs=epochs)
274
275
"""
276
## Evaluate the model on the test set
277
"""
278
279
model.evaluate(test_ds)
280
281
"""
282
## Make an end-to-end model
283
284
If you want to obtain a model capable of processing raw strings, you can simply
285
create a new model (using the weights we just trained):
286
"""
287
288
# A string input
289
inputs = keras.Input(shape=(1,), dtype="string")
290
# Turn strings into vocab indices
291
indices = vectorize_layer(inputs)
292
# Turn vocab indices into predictions
293
outputs = model(indices)
294
295
# Our end to end model
296
end_to_end_model = keras.Model(inputs, outputs)
297
end_to_end_model.compile(
298
loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
299
)
300
301
# Test it with `raw_test_ds`, which yields raw strings
302
end_to_end_model.evaluate(raw_test_ds)
303
304