CoCalc -- pretrained_word

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/pretrained_word_embeddings.py
³⁵⁰⁷ views
1
"""
2
Title: Using pre-trained word embeddings
3
Author: [fchollet](https://twitter.com/fchollet)
4
Date created: 2020/05/05
5
Last modified: 2020/05/05
6
Description: Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Setup
12
"""
13

14
import os
15

16
# Only the TensorFlow backend supports string inputs.
17
os.environ["KERAS_BACKEND"] = "tensorflow"
18

19
import pathlib
20
import numpy as np
21
import tensorflow.data as tf_data
22
import keras
23
from keras import layers
24

25
"""
26
## Introduction
27

28
In this example, we show how to train a text classification model that uses pre-trained
29
word embeddings.
30

31
We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
32
belonging to 20 different topic categories.
33

34
For the pre-trained word embeddings, we'll use
35
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).
36
"""
37

38
"""
39
## Download the Newsgroup20 data
40
"""
41

42
data_path = keras.utils.get_file(
43
    "news20.tar.gz",
44
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
45
    untar=True,
46
)
47

48
"""
49
## Let's take a look at the data
50
"""
51

52
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
53
dirnames = os.listdir(data_dir)
54
print("Number of directories:", len(dirnames))
55
print("Directory names:", dirnames)
56

57
fnames = os.listdir(data_dir / "comp.graphics")
58
print("Number of files in comp.graphics:", len(fnames))
59
print("Some example filenames:", fnames[:5])
60

61
"""
62
Here's a example of what one file contains:
63
"""
64

65
print(open(data_dir / "comp.graphics" / "38987").read())
66

67
"""
68
As you can see, there are header lines that are leaking the file's category, either
69
explicitly (the first line is literally the category name), or implicitly, e.g. via the
70
`Organization` filed. Let's get rid of the headers:
71
"""
72

73
samples = []
74
labels = []
75
class_names = []
76
class_index = 0
77
for dirname in sorted(os.listdir(data_dir)):
78
    class_names.append(dirname)
79
    dirpath = data_dir / dirname
80
    fnames = os.listdir(dirpath)
81
    print("Processing %s, %d files found" % (dirname, len(fnames)))
82
    for fname in fnames:
83
        fpath = dirpath / fname
84
        f = open(fpath, encoding="latin-1")
85
        content = f.read()
86
        lines = content.split("\n")
87
        lines = lines[10:]
88
        content = "\n".join(lines)
89
        samples.append(content)
90
        labels.append(class_index)
91
    class_index += 1
92

93
print("Classes:", class_names)
94
print("Number of samples:", len(samples))
95

96
"""
97
There's actually one category that doesn't have the expected number of files, but the
98
difference is small enough that the problem remains a balanced classification problem.
99
"""
100

101
"""
102
## Shuffle and split the data into training & validation sets
103
"""
104

105
# Shuffle the data
106
seed = 1337
107
rng = np.random.RandomState(seed)
108
rng.shuffle(samples)
109
rng = np.random.RandomState(seed)
110
rng.shuffle(labels)
111

112
# Extract a training & validation split
113
validation_split = 0.2
114
num_validation_samples = int(validation_split * len(samples))
115
train_samples = samples[:-num_validation_samples]
116
val_samples = samples[-num_validation_samples:]
117
train_labels = labels[:-num_validation_samples]
118
val_labels = labels[-num_validation_samples:]
119

120
"""
121
## Create a vocabulary index
122

123
Let's use the `TextVectorization` to index the vocabulary found in the dataset.
124
Later, we'll use the same layer instance to vectorize the samples.
125

126
Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
127
be actually 200 tokens long.
128
"""
129

130
vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200)
131
text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)
132
vectorizer.adapt(text_ds)
133

134
"""
135
You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
136
print the top 5 words:
137
"""
138

139
vectorizer.get_vocabulary()[:5]
140

141
"""
142
Let's vectorize a test sentence:
143
"""
144

145
output = vectorizer([["the cat sat on the mat"]])
146
output.numpy()[0, :6]
147

148
"""
149
As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
150
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
151
reserved for "out of vocabulary" tokens.
152

153
Here's a dict mapping words to their indices:
154
"""
155

156
voc = vectorizer.get_vocabulary()
157
word_index = dict(zip(voc, range(len(voc))))
158

159
"""
160
As you can see, we obtain the same encoding as above for our test sentence:
161
"""
162

163
test = ["the", "cat", "sat", "on", "the", "mat"]
164
[word_index[w] for w in test]
165

166
"""
167
## Load pre-trained word embeddings
168
"""
169

170
"""
171
Let's download pre-trained GloVe embeddings (a 822M zip file).
172

173
You'll need to run the following commands:
174
"""
175

176
"""shell
177
wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
178
unzip -q glove.6B.zip
179
"""
180

181
"""
182
The archive contains text-encoded vectors of various sizes: 50-dimensional,
183
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.
184

185
Let's make a dict mapping words (strings) to their NumPy vector representation:
186
"""
187

188
path_to_glove_file = "glove.6B.100d.txt"
189

190
embeddings_index = {}
191
with open(path_to_glove_file) as f:
192
    for line in f:
193
        word, coefs = line.split(maxsplit=1)
194
        coefs = np.fromstring(coefs, "f", sep=" ")
195
        embeddings_index[word] = coefs
196

197
print("Found %s word vectors." % len(embeddings_index))
198

199
"""
200
Now, let's prepare a corresponding embedding matrix that we can use in a Keras
201
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
202
vector for the word of index `i` in our `vectorizer`'s vocabulary.
203
"""
204

205
num_tokens = len(voc) + 2
206
embedding_dim = 100
207
hits = 0
208
misses = 0
209

210
# Prepare embedding matrix
211
embedding_matrix = np.zeros((num_tokens, embedding_dim))
212
for word, i in word_index.items():
213
    embedding_vector = embeddings_index.get(word)
214
    if embedding_vector is not None:
215
        # Words not found in embedding index will be all-zeros.
216
        # This includes the representation for "padding" and "OOV"
217
        embedding_matrix[i] = embedding_vector
218
        hits += 1
219
    else:
220
        misses += 1
221
print("Converted %d words (%d misses)" % (hits, misses))
222

223

224
"""
225
Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.
226

227
Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
228
update them during training).
229
"""
230

231
from keras.layers import Embedding
232

233
embedding_layer = Embedding(
234
    num_tokens,
235
    embedding_dim,
236
    trainable=False,
237
)
238
embedding_layer.build((1,))
239
embedding_layer.set_weights([embedding_matrix])
240

241
"""
242
## Build the model
243

244
A simple 1D convnet with global max pooling and a classifier at the end.
245
"""
246

247
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
248
embedded_sequences = embedding_layer(int_sequences_input)
249
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
250
x = layers.MaxPooling1D(5)(x)
251
x = layers.Conv1D(128, 5, activation="relu")(x)
252
x = layers.MaxPooling1D(5)(x)
253
x = layers.Conv1D(128, 5, activation="relu")(x)
254
x = layers.GlobalMaxPooling1D()(x)
255
x = layers.Dense(128, activation="relu")(x)
256
x = layers.Dropout(0.5)(x)
257
preds = layers.Dense(len(class_names), activation="softmax")(x)
258
model = keras.Model(int_sequences_input, preds)
259
model.summary()
260

261
"""
262
## Train the model
263

264
First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
265
are right-padded.
266
"""
267

268
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
269
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
270

271
y_train = np.array(train_labels)
272
y_val = np.array(val_labels)
273

274
"""
275
We use categorical crossentropy as our loss since we're doing softmax classification.
276
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.
277
"""
278

279
model.compile(
280
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
281
)
282
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))
283

284
"""
285
## Export an end-to-end model
286

287
Now, we may want to export a `Model` object that takes as input a string of arbitrary
288
length, rather than a sequence of indices. It would make the model much more portable,
289
since you wouldn't have to worry about the input preprocessing pipeline.
290

291
Our `vectorizer` is actually a Keras layer, so it's simple:
292
"""
293

294
string_input = keras.Input(shape=(1,), dtype="string")
295
x = vectorizer(string_input)
296
preds = model(x)
297
end_to_end_model = keras.Model(string_input, preds)
298

299
probabilities = end_to_end_model(
300
    keras.ops.convert_to_tensor(
301
        [["this message is about computer graphics and 3D modeling"]]
302
    )
303
)
304

305
print(class_names[np.argmax(probabilities[0])])
306

307
Product

Resources

Company