Path: blob/master/examples/nlp/pretrained_word_embeddings.py
3507 views
"""1Title: Using pre-trained word embeddings2Author: [fchollet](https://twitter.com/fchollet)3Date created: 2020/05/054Last modified: 2020/05/055Description: Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.6Accelerator: GPU7"""89"""10## Setup11"""1213import os1415# Only the TensorFlow backend supports string inputs.16os.environ["KERAS_BACKEND"] = "tensorflow"1718import pathlib19import numpy as np20import tensorflow.data as tf_data21import keras22from keras import layers2324"""25## Introduction2627In this example, we show how to train a text classification model that uses pre-trained28word embeddings.2930We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages31belonging to 20 different topic categories.3233For the pre-trained word embeddings, we'll use34[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).35"""3637"""38## Download the Newsgroup20 data39"""4041data_path = keras.utils.get_file(42"news20.tar.gz",43"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",44untar=True,45)4647"""48## Let's take a look at the data49"""5051data_dir = pathlib.Path(data_path).parent / "20_newsgroup"52dirnames = os.listdir(data_dir)53print("Number of directories:", len(dirnames))54print("Directory names:", dirnames)5556fnames = os.listdir(data_dir / "comp.graphics")57print("Number of files in comp.graphics:", len(fnames))58print("Some example filenames:", fnames[:5])5960"""61Here's a example of what one file contains:62"""6364print(open(data_dir / "comp.graphics" / "38987").read())6566"""67As you can see, there are header lines that are leaking the file's category, either68explicitly (the first line is literally the category name), or implicitly, e.g. via the69`Organization` filed. Let's get rid of the headers:70"""7172samples = []73labels = []74class_names = []75class_index = 076for dirname in sorted(os.listdir(data_dir)):77class_names.append(dirname)78dirpath = data_dir / dirname79fnames = os.listdir(dirpath)80print("Processing %s, %d files found" % (dirname, len(fnames)))81for fname in fnames:82fpath = dirpath / fname83f = open(fpath, encoding="latin-1")84content = f.read()85lines = content.split("\n")86lines = lines[10:]87content = "\n".join(lines)88samples.append(content)89labels.append(class_index)90class_index += 19192print("Classes:", class_names)93print("Number of samples:", len(samples))9495"""96There's actually one category that doesn't have the expected number of files, but the97difference is small enough that the problem remains a balanced classification problem.98"""99100"""101## Shuffle and split the data into training & validation sets102"""103104# Shuffle the data105seed = 1337106rng = np.random.RandomState(seed)107rng.shuffle(samples)108rng = np.random.RandomState(seed)109rng.shuffle(labels)110111# Extract a training & validation split112validation_split = 0.2113num_validation_samples = int(validation_split * len(samples))114train_samples = samples[:-num_validation_samples]115val_samples = samples[-num_validation_samples:]116train_labels = labels[:-num_validation_samples]117val_labels = labels[-num_validation_samples:]118119"""120## Create a vocabulary index121122Let's use the `TextVectorization` to index the vocabulary found in the dataset.123Later, we'll use the same layer instance to vectorize the samples.124125Our layer will only consider the top 20,000 words, and will truncate or pad sequences to126be actually 200 tokens long.127"""128129vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200)130text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)131vectorizer.adapt(text_ds)132133"""134You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's135print the top 5 words:136"""137138vectorizer.get_vocabulary()[:5]139140"""141Let's vectorize a test sentence:142"""143144output = vectorizer([["the cat sat on the mat"]])145output.numpy()[0, :6]146147"""148As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first149word in the vocabulary? That's because index 0 is reserved for padding and index 1 is150reserved for "out of vocabulary" tokens.151152Here's a dict mapping words to their indices:153"""154155voc = vectorizer.get_vocabulary()156word_index = dict(zip(voc, range(len(voc))))157158"""159As you can see, we obtain the same encoding as above for our test sentence:160"""161162test = ["the", "cat", "sat", "on", "the", "mat"]163[word_index[w] for w in test]164165"""166## Load pre-trained word embeddings167"""168169"""170Let's download pre-trained GloVe embeddings (a 822M zip file).171172You'll need to run the following commands:173"""174175"""shell176wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip177unzip -q glove.6B.zip178"""179180"""181The archive contains text-encoded vectors of various sizes: 50-dimensional,182100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.183184Let's make a dict mapping words (strings) to their NumPy vector representation:185"""186187path_to_glove_file = "glove.6B.100d.txt"188189embeddings_index = {}190with open(path_to_glove_file) as f:191for line in f:192word, coefs = line.split(maxsplit=1)193coefs = np.fromstring(coefs, "f", sep=" ")194embeddings_index[word] = coefs195196print("Found %s word vectors." % len(embeddings_index))197198"""199Now, let's prepare a corresponding embedding matrix that we can use in a Keras200`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained201vector for the word of index `i` in our `vectorizer`'s vocabulary.202"""203204num_tokens = len(voc) + 2205embedding_dim = 100206hits = 0207misses = 0208209# Prepare embedding matrix210embedding_matrix = np.zeros((num_tokens, embedding_dim))211for word, i in word_index.items():212embedding_vector = embeddings_index.get(word)213if embedding_vector is not None:214# Words not found in embedding index will be all-zeros.215# This includes the representation for "padding" and "OOV"216embedding_matrix[i] = embedding_vector217hits += 1218else:219misses += 1220print("Converted %d words (%d misses)" % (hits, misses))221222223"""224Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.225226Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to227update them during training).228"""229230from keras.layers import Embedding231232embedding_layer = Embedding(233num_tokens,234embedding_dim,235trainable=False,236)237embedding_layer.build((1,))238embedding_layer.set_weights([embedding_matrix])239240"""241## Build the model242243A simple 1D convnet with global max pooling and a classifier at the end.244"""245246int_sequences_input = keras.Input(shape=(None,), dtype="int32")247embedded_sequences = embedding_layer(int_sequences_input)248x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)249x = layers.MaxPooling1D(5)(x)250x = layers.Conv1D(128, 5, activation="relu")(x)251x = layers.MaxPooling1D(5)(x)252x = layers.Conv1D(128, 5, activation="relu")(x)253x = layers.GlobalMaxPooling1D()(x)254x = layers.Dense(128, activation="relu")(x)255x = layers.Dropout(0.5)(x)256preds = layers.Dense(len(class_names), activation="softmax")(x)257model = keras.Model(int_sequences_input, preds)258model.summary()259260"""261## Train the model262263First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays264are right-padded.265"""266267x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()268x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()269270y_train = np.array(train_labels)271y_val = np.array(val_labels)272273"""274We use categorical crossentropy as our loss since we're doing softmax classification.275Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.276"""277278model.compile(279loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]280)281model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))282283"""284## Export an end-to-end model285286Now, we may want to export a `Model` object that takes as input a string of arbitrary287length, rather than a sequence of indices. It would make the model much more portable,288since you wouldn't have to worry about the input preprocessing pipeline.289290Our `vectorizer` is actually a Keras layer, so it's simple:291"""292293string_input = keras.Input(shape=(1,), dtype="string")294x = vectorizer(string_input)295preds = model(x)296end_to_end_model = keras.Model(string_input, preds)297298probabilities = end_to_end_model(299keras.ops.convert_to_tensor(300[["this message is about computer graphics and 3D modeling"]]301)302)303304print(class_names[np.argmax(probabilities[0])])305306307