Path: blob/main/C3/W4/ungraded_labs/C3_W4_Lab_1.ipynb
2956 views
Ungraded Lab: Generating Text with Neural Networks
For this week, you will look at techniques to prepare data and build models for text generation. You will train a neural network with lyrics from an Irish song then let it make a new song for you. Though this might sound like a more complex application, you'll soon see that the process is very similar to the ones you've been using in the previous weeks. Only minor modifications are needed. Let's see what these are in the next sections.
Imports
First, you will import the required libraries. You've used all of these already in the previous labs.
Building the Word Vocabulary
The dataset is the lyrics of Lanigan's Ball, a traditional Irish song. You will split it per line then use a TextVectorization
layer to build the vocabulary.
You can view the results with the cell below. The resulting vocabulary is 262 words (not including the special tokens for padding and out-of-vocabulary words). You will these variables later.
Preprocessing the Dataset
As discussed in the lectures, you will take each line of the song and create inputs and labels from it. For example, if you only have one sentence: "I am using Tensorflow", you want the model to learn the next word given any subphrase of this sentence:
The next cell shows how to implement this concept in code. The result would be inputs as padded sequences, and labels as one-hot encoded arrays.
Check the result for the first line of the song. The particular line and the expected token sequence is shown in the cell below:
Since there are 8 tokens here, you can expect to find this particular line in the first 7 elements of the xs
that you generated earlier. If we get the longest subphrase generated, that should be found in xs[6]
. See the padded token sequence below:
If you print out the label, it should show 174
because that is the next word in the phrase (i.e. lanigan
). See the one-hot encoded form below. You can use the np.argmax()
method to get the index of the 'hot' label.
If you pick the element before that, you will see the same subphrase as above minus one word:
Build the Model
Next, you will build the model with basically the same layers as before. The main difference is you will remove the sigmoid output and use a softmax activated Dense
layer instead. This output layer will have one neuron for each word in the vocabulary. So given an input token list, the output array of the final layer will have the probabilities for each word.
Train the model
You can now train the model. We have a relatively small vocabulary so it will only take a couple of minutes to complete 500 epochs.
You can visualize the results with the utility below. With the default settings, you should see around 95% accuracy after 500 epochs.
Generating Text
With the model trained, you can now use it to make its own song! The process would look like:
Feed a seed text to initiate the process.
Model predicts the index of the most probable next word.
Look up the index in the reverse word index dictionary
Append the next word to the seed text.
Feed the result to the model again.
Steps 2 to 5 will repeat until the desired length of the song is reached. See how it is implemented in the code below:
In the output above, you might notice frequent repetition of words the longer the sentence gets. There are ways to get around it and the next cell shows one. Basically, instead of getting the index with max probability, you will get the top three indices and choose one at random. See if the output text makes more sense with this approach. This is not the most time efficient solution because it is always sorting the entire array even if you only need the top three. Feel free to improve it and of course, you can also develop your own method of picking the next word.
Wrap Up
In this lab, you got a first look at preparing data and building a model for text generation. The corpus is fairly small in this particular exercise and in the next lessons, you will be building one from a larger body of text. See you there!
Run the cell below to free up resources for the next lab.