Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/pretrained_word_embeddings.py
3507 views
1
"""
2
Title: Using pre-trained word embeddings
3
Author: [fchollet](https://twitter.com/fchollet)
4
Date created: 2020/05/05
5
Last modified: 2020/05/05
6
Description: Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Setup
12
"""
13
14
import os
15
16
# Only the TensorFlow backend supports string inputs.
17
os.environ["KERAS_BACKEND"] = "tensorflow"
18
19
import pathlib
20
import numpy as np
21
import tensorflow.data as tf_data
22
import keras
23
from keras import layers
24
25
"""
26
## Introduction
27
28
In this example, we show how to train a text classification model that uses pre-trained
29
word embeddings.
30
31
We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
32
belonging to 20 different topic categories.
33
34
For the pre-trained word embeddings, we'll use
35
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).
36
"""
37
38
"""
39
## Download the Newsgroup20 data
40
"""
41
42
data_path = keras.utils.get_file(
43
"news20.tar.gz",
44
"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
45
untar=True,
46
)
47
48
"""
49
## Let's take a look at the data
50
"""
51
52
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
53
dirnames = os.listdir(data_dir)
54
print("Number of directories:", len(dirnames))
55
print("Directory names:", dirnames)
56
57
fnames = os.listdir(data_dir / "comp.graphics")
58
print("Number of files in comp.graphics:", len(fnames))
59
print("Some example filenames:", fnames[:5])
60
61
"""
62
Here's a example of what one file contains:
63
"""
64
65
print(open(data_dir / "comp.graphics" / "38987").read())
66
67
"""
68
As you can see, there are header lines that are leaking the file's category, either
69
explicitly (the first line is literally the category name), or implicitly, e.g. via the
70
`Organization` filed. Let's get rid of the headers:
71
"""
72
73
samples = []
74
labels = []
75
class_names = []
76
class_index = 0
77
for dirname in sorted(os.listdir(data_dir)):
78
class_names.append(dirname)
79
dirpath = data_dir / dirname
80
fnames = os.listdir(dirpath)
81
print("Processing %s, %d files found" % (dirname, len(fnames)))
82
for fname in fnames:
83
fpath = dirpath / fname
84
f = open(fpath, encoding="latin-1")
85
content = f.read()
86
lines = content.split("\n")
87
lines = lines[10:]
88
content = "\n".join(lines)
89
samples.append(content)
90
labels.append(class_index)
91
class_index += 1
92
93
print("Classes:", class_names)
94
print("Number of samples:", len(samples))
95
96
"""
97
There's actually one category that doesn't have the expected number of files, but the
98
difference is small enough that the problem remains a balanced classification problem.
99
"""
100
101
"""
102
## Shuffle and split the data into training & validation sets
103
"""
104
105
# Shuffle the data
106
seed = 1337
107
rng = np.random.RandomState(seed)
108
rng.shuffle(samples)
109
rng = np.random.RandomState(seed)
110
rng.shuffle(labels)
111
112
# Extract a training & validation split
113
validation_split = 0.2
114
num_validation_samples = int(validation_split * len(samples))
115
train_samples = samples[:-num_validation_samples]
116
val_samples = samples[-num_validation_samples:]
117
train_labels = labels[:-num_validation_samples]
118
val_labels = labels[-num_validation_samples:]
119
120
"""
121
## Create a vocabulary index
122
123
Let's use the `TextVectorization` to index the vocabulary found in the dataset.
124
Later, we'll use the same layer instance to vectorize the samples.
125
126
Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
127
be actually 200 tokens long.
128
"""
129
130
vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200)
131
text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)
132
vectorizer.adapt(text_ds)
133
134
"""
135
You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
136
print the top 5 words:
137
"""
138
139
vectorizer.get_vocabulary()[:5]
140
141
"""
142
Let's vectorize a test sentence:
143
"""
144
145
output = vectorizer([["the cat sat on the mat"]])
146
output.numpy()[0, :6]
147
148
"""
149
As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
150
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
151
reserved for "out of vocabulary" tokens.
152
153
Here's a dict mapping words to their indices:
154
"""
155
156
voc = vectorizer.get_vocabulary()
157
word_index = dict(zip(voc, range(len(voc))))
158
159
"""
160
As you can see, we obtain the same encoding as above for our test sentence:
161
"""
162
163
test = ["the", "cat", "sat", "on", "the", "mat"]
164
[word_index[w] for w in test]
165
166
"""
167
## Load pre-trained word embeddings
168
"""
169
170
"""
171
Let's download pre-trained GloVe embeddings (a 822M zip file).
172
173
You'll need to run the following commands:
174
"""
175
176
"""shell
177
wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
178
unzip -q glove.6B.zip
179
"""
180
181
"""
182
The archive contains text-encoded vectors of various sizes: 50-dimensional,
183
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.
184
185
Let's make a dict mapping words (strings) to their NumPy vector representation:
186
"""
187
188
path_to_glove_file = "glove.6B.100d.txt"
189
190
embeddings_index = {}
191
with open(path_to_glove_file) as f:
192
for line in f:
193
word, coefs = line.split(maxsplit=1)
194
coefs = np.fromstring(coefs, "f", sep=" ")
195
embeddings_index[word] = coefs
196
197
print("Found %s word vectors." % len(embeddings_index))
198
199
"""
200
Now, let's prepare a corresponding embedding matrix that we can use in a Keras
201
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
202
vector for the word of index `i` in our `vectorizer`'s vocabulary.
203
"""
204
205
num_tokens = len(voc) + 2
206
embedding_dim = 100
207
hits = 0
208
misses = 0
209
210
# Prepare embedding matrix
211
embedding_matrix = np.zeros((num_tokens, embedding_dim))
212
for word, i in word_index.items():
213
embedding_vector = embeddings_index.get(word)
214
if embedding_vector is not None:
215
# Words not found in embedding index will be all-zeros.
216
# This includes the representation for "padding" and "OOV"
217
embedding_matrix[i] = embedding_vector
218
hits += 1
219
else:
220
misses += 1
221
print("Converted %d words (%d misses)" % (hits, misses))
222
223
224
"""
225
Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.
226
227
Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
228
update them during training).
229
"""
230
231
from keras.layers import Embedding
232
233
embedding_layer = Embedding(
234
num_tokens,
235
embedding_dim,
236
trainable=False,
237
)
238
embedding_layer.build((1,))
239
embedding_layer.set_weights([embedding_matrix])
240
241
"""
242
## Build the model
243
244
A simple 1D convnet with global max pooling and a classifier at the end.
245
"""
246
247
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
248
embedded_sequences = embedding_layer(int_sequences_input)
249
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
250
x = layers.MaxPooling1D(5)(x)
251
x = layers.Conv1D(128, 5, activation="relu")(x)
252
x = layers.MaxPooling1D(5)(x)
253
x = layers.Conv1D(128, 5, activation="relu")(x)
254
x = layers.GlobalMaxPooling1D()(x)
255
x = layers.Dense(128, activation="relu")(x)
256
x = layers.Dropout(0.5)(x)
257
preds = layers.Dense(len(class_names), activation="softmax")(x)
258
model = keras.Model(int_sequences_input, preds)
259
model.summary()
260
261
"""
262
## Train the model
263
264
First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
265
are right-padded.
266
"""
267
268
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
269
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
270
271
y_train = np.array(train_labels)
272
y_val = np.array(val_labels)
273
274
"""
275
We use categorical crossentropy as our loss since we're doing softmax classification.
276
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.
277
"""
278
279
model.compile(
280
loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
281
)
282
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))
283
284
"""
285
## Export an end-to-end model
286
287
Now, we may want to export a `Model` object that takes as input a string of arbitrary
288
length, rather than a sequence of indices. It would make the model much more portable,
289
since you wouldn't have to worry about the input preprocessing pipeline.
290
291
Our `vectorizer` is actually a Keras layer, so it's simple:
292
"""
293
294
string_input = keras.Input(shape=(1,), dtype="string")
295
x = vectorizer(string_input)
296
preds = model(x)
297
end_to_end_model = keras.Model(string_input, preds)
298
299
probabilities = end_to_end_model(
300
keras.ops.convert_to_tensor(
301
[["this message is about computer graphics and 3D modeling"]]
302
)
303
)
304
305
print(class_names[np.argmax(probabilities[0])])
306
307