CoCalc -- text_classification_from

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/text_classification_from_scratch.py
³⁵⁰⁷ views
1
"""
2
Title: Text classification from scratch
3
Authors: Mark Omernick, Francois Chollet
4
Date created: 2019/11/06
5
Last modified: 2020/05/17
6
Description: Text sentiment classification starting from raw text files.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
This example shows how to do text classification starting from raw text (as
14
a set of text files on disk). We demonstrate the workflow on the IMDB sentiment
15
classification dataset (unprocessed version). We use the `TextVectorization` layer for
16
 word splitting & indexing.
17
"""
18

19
"""
20
## Setup
21
"""
22

23
import os
24

25
os.environ["KERAS_BACKEND"] = "tensorflow"
26

27
import keras
28
import tensorflow as tf
29
import numpy as np
30
from keras import layers
31

32
"""
33
## Load the data: IMDB movie review sentiment classification
34

35
Let's download the data and inspect its structure.
36
"""
37

38
"""shell
39
curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
40
tar -xf aclImdb_v1.tar.gz
41
"""
42

43
"""
44
The `aclImdb` folder contains a `train` and `test` subfolder:
45
"""
46

47
"""shell
48
ls aclImdb
49
"""
50

51
"""shell
52
ls aclImdb/test
53
"""
54

55
"""shell
56
ls aclImdb/train
57
"""
58

59
"""
60
The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of
61
 which represents one review (either positive or negative):
62
"""
63

64
"""shell
65
cat aclImdb/train/pos/6248_7.txt
66
"""
67

68
"""
69
We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:
70
"""
71

72
"""shell
73
rm -r aclImdb/train/unsup
74
"""
75

76
"""
77
You can use the utility `keras.utils.text_dataset_from_directory` to
78
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed
79
 into class-specific folders.
80

81
Let's use it to generate the training, validation, and test datasets. The validation
82
and training datasets are generated from two subsets of the `train` directory, with 20%
83
of samples going to the validation dataset and 80% going to the training dataset.
84

85
Having a validation dataset in addition to the test dataset is useful for tuning
86
hyperparameters, such as the model architecture, for which the test dataset should not
87
be used.
88

89
Before putting the model out into the real world however, it should be retrained using all
90
available training data (without creating a validation dataset), so its performance is maximized.
91

92
When using the `validation_split` & `subset` arguments, make sure to either specify a
93
random seed, or to pass `shuffle=False`, so that the validation & training splits you
94
get have no overlap.
95

96
"""
97

98
batch_size = 32
99
raw_train_ds = keras.utils.text_dataset_from_directory(
100
    "aclImdb/train",
101
    batch_size=batch_size,
102
    validation_split=0.2,
103
    subset="training",
104
    seed=1337,
105
)
106
raw_val_ds = keras.utils.text_dataset_from_directory(
107
    "aclImdb/train",
108
    batch_size=batch_size,
109
    validation_split=0.2,
110
    subset="validation",
111
    seed=1337,
112
)
113
raw_test_ds = keras.utils.text_dataset_from_directory(
114
    "aclImdb/test", batch_size=batch_size
115
)
116

117
print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
118
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
119
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")
120

121
"""
122
Let's preview a few samples:
123
"""
124

125
# It's important to take a look at your raw data to ensure your normalization
126
# and tokenization will work as expected. We can do that by taking a few
127
# examples from the training set and looking at them.
128
# This is one of the places where eager execution shines:
129
# we can just evaluate these tensors using .numpy()
130
# instead of needing to evaluate them in a Session/Graph context.
131
for text_batch, label_batch in raw_train_ds.take(1):
132
    for i in range(5):
133
        print(text_batch.numpy()[i])
134
        print(label_batch.numpy()[i])
135

136
"""
137
## Prepare the data
138

139
In particular, we remove `<br />` tags.
140
"""
141

142
import string
143
import re
144

145

146
# Having looked at our data above, we see that the raw text contains HTML break
147
# tags of the form '<br />'. These tags will not be removed by the default
148
# standardizer (which doesn't strip HTML). Because of this, we will need to
149
# create a custom standardization function.
150
def custom_standardization(input_data):
151
    lowercase = tf.strings.lower(input_data)
152
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
153
    return tf.strings.regex_replace(
154
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
155
    )
156

157

158
# Model constants.
159
max_features = 20000
160
embedding_dim = 128
161
sequence_length = 500
162

163
# Now that we have our custom standardization, we can instantiate our text
164
# vectorization layer. We are using this layer to normalize, split, and map
165
# strings to integers, so we set our 'output_mode' to 'int'.
166
# Note that we're using the default split function,
167
# and the custom standardization defined above.
168
# We also set an explicit maximum sequence length, since the CNNs later in our
169
# model won't support ragged sequences.
170
vectorize_layer = keras.layers.TextVectorization(
171
    standardize=custom_standardization,
172
    max_tokens=max_features,
173
    output_mode="int",
174
    output_sequence_length=sequence_length,
175
)
176

177
# Now that the vectorize_layer has been created, call `adapt` on a text-only
178
# dataset to create the vocabulary. You don't have to batch, but for very large
179
# datasets this means you're not keeping spare copies of the dataset in memory.
180

181
# Let's make a text-only dataset (no labels):
182
text_ds = raw_train_ds.map(lambda x, y: x)
183
# Let's call `adapt`:
184
vectorize_layer.adapt(text_ds)
185

186
"""
187
## Two options to vectorize the data
188

189
There are 2 ways we can use our text vectorization layer:
190

191
**Option 1: Make it part of the model**, so as to obtain a model that processes raw
192
 strings, like this:
193
"""
194

195
"""
196

197
```python
198
text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
199
x = vectorize_layer(text_input)
200
x = layers.Embedding(max_features + 1, embedding_dim)(x)
201
...
202
```
203

204
**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then
205
 feed it into a model that expects integer sequences as inputs.
206

207
An important difference between the two is that option 2 enables you to do
208
**asynchronous CPU processing and buffering** of your data when training on GPU.
209
So if you're training the model on GPU, you probably want to go with this option to get
210
 the best performance. This is what we will do below.
211

212
If we were to export our model to production, we'd ship a model that accepts raw
213
strings as input, like in the code snippet for option 1 above. This can be done after
214
 training. We do this in the last section.
215

216

217
"""
218

219

220
def vectorize_text(text, label):
221
    text = tf.expand_dims(text, -1)
222
    return vectorize_layer(text), label
223

224

225
# Vectorize the data.
226
train_ds = raw_train_ds.map(vectorize_text)
227
val_ds = raw_val_ds.map(vectorize_text)
228
test_ds = raw_test_ds.map(vectorize_text)
229

230
# Do async prefetching / buffering of the data for best performance on GPU.
231
train_ds = train_ds.cache().prefetch(buffer_size=10)
232
val_ds = val_ds.cache().prefetch(buffer_size=10)
233
test_ds = test_ds.cache().prefetch(buffer_size=10)
234

235
"""
236
## Build a model
237

238
We choose a simple 1D convnet starting with an `Embedding` layer.
239
"""
240

241
# A integer input for vocab indices.
242
inputs = keras.Input(shape=(None,), dtype="int64")
243

244
# Next, we add a layer to map those vocab indices into a space of dimensionality
245
# 'embedding_dim'.
246
x = layers.Embedding(max_features, embedding_dim)(inputs)
247
x = layers.Dropout(0.5)(x)
248

249
# Conv1D + global max pooling
250
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
251
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
252
x = layers.GlobalMaxPooling1D()(x)
253

254
# We add a vanilla hidden layer:
255
x = layers.Dense(128, activation="relu")(x)
256
x = layers.Dropout(0.5)(x)
257

258
# We project onto a single unit output layer, and squash it with a sigmoid:
259
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)
260

261
model = keras.Model(inputs, predictions)
262

263
# Compile the model with binary crossentropy loss and an adam optimizer.
264
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
265

266
"""
267
## Train the model
268
"""
269

270
epochs = 3
271

272
# Fit the model using the train and test datasets.
273
model.fit(train_ds, validation_data=val_ds, epochs=epochs)
274

275
"""
276
## Evaluate the model on the test set
277
"""
278

279
model.evaluate(test_ds)
280

281
"""
282
## Make an end-to-end model
283

284
If you want to obtain a model capable of processing raw strings, you can simply
285
create a new model (using the weights we just trained):
286
"""
287

288
# A string input
289
inputs = keras.Input(shape=(1,), dtype="string")
290
# Turn strings into vocab indices
291
indices = vectorize_layer(inputs)
292
# Turn vocab indices into predictions
293
outputs = model(indices)
294

295
# Our end to end model
296
end_to_end_model = keras.Model(inputs, outputs)
297
end_to_end_model.compile(
298
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
299
)
300

301
# Test it with `raw_test_ds`, which yields raw strings
302
end_to_end_model.evaluate(raw_test_ds)
303

304
Product

Resources

Company