CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/char_rnn_generation_tutorial.py
Views: 712
1
# -*- coding: utf-8 -*-
2
"""
3
NLP From Scratch: Generating Names with a Character-Level RNN
4
*************************************************************
5
**Author**: `Sean Robertson <https://github.com/spro>`_
6
7
This tutorials is part of a three-part series:
8
9
* `NLP From Scratch: Classifying Names with a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`__
10
* `NLP From Scratch: Generating Names with a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html>`__
11
* `NLP From Scratch: Translation with a Sequence to Sequence Network and Attention <https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html>`__
12
13
This is our second of three tutorials on "NLP From Scratch".
14
In the `first tutorial </tutorials/intermediate/char_rnn_classification_tutorial>`_
15
we used a RNN to classify names into their language of origin. This time
16
we'll turn around and generate names from languages.
17
18
.. code-block:: sh
19
20
> python sample.py Russian RUS
21
Rovakov
22
Uantov
23
Shavakov
24
25
> python sample.py German GER
26
Gerren
27
Ereng
28
Rosher
29
30
> python sample.py Spanish SPA
31
Salla
32
Parer
33
Allan
34
35
> python sample.py Chinese CHI
36
Chan
37
Hang
38
Iun
39
40
We are still hand-crafting a small RNN with a few linear layers. The big
41
difference is instead of predicting a category after reading in all the
42
letters of a name, we input a category and output one letter at a time.
43
Recurrently predicting characters to form language (this could also be
44
done with words or other higher order constructs) is often referred to
45
as a "language model".
46
47
**Recommended Reading:**
48
49
I assume you have at least installed PyTorch, know Python, and
50
understand Tensors:
51
52
- https://pytorch.org/ For installation instructions
53
- :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
54
- :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
55
- :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user
56
57
It would also be useful to know about RNNs and how they work:
58
59
- `The Unreasonable Effectiveness of Recurrent Neural
60
Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__
61
shows a bunch of real life examples
62
- `Understanding LSTM
63
Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__
64
is about LSTMs specifically but also informative about RNNs in
65
general
66
67
I also suggest the previous tutorial, :doc:`/intermediate/char_rnn_classification_tutorial`
68
69
70
Preparing the Data
71
==================
72
73
.. note::
74
Download the data from
75
`here <https://download.pytorch.org/tutorial/data.zip>`_
76
and extract it to the current directory.
77
78
See the last tutorial for more detail of this process. In short, there
79
are a bunch of plain text files ``data/names/[Language].txt`` with a
80
name per line. We split lines into an array, convert Unicode to ASCII,
81
and end up with a dictionary ``{language: [names ...]}``.
82
83
"""
84
from io import open
85
import glob
86
import os
87
import unicodedata
88
import string
89
90
all_letters = string.ascii_letters + " .,;'-"
91
n_letters = len(all_letters) + 1 # Plus EOS marker
92
93
def findFiles(path): return glob.glob(path)
94
95
# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
96
def unicodeToAscii(s):
97
return ''.join(
98
c for c in unicodedata.normalize('NFD', s)
99
if unicodedata.category(c) != 'Mn'
100
and c in all_letters
101
)
102
103
# Read a file and split into lines
104
def readLines(filename):
105
with open(filename, encoding='utf-8') as some_file:
106
return [unicodeToAscii(line.strip()) for line in some_file]
107
108
# Build the category_lines dictionary, a list of lines per category
109
category_lines = {}
110
all_categories = []
111
for filename in findFiles('data/names/*.txt'):
112
category = os.path.splitext(os.path.basename(filename))[0]
113
all_categories.append(category)
114
lines = readLines(filename)
115
category_lines[category] = lines
116
117
n_categories = len(all_categories)
118
119
if n_categories == 0:
120
raise RuntimeError('Data not found. Make sure that you downloaded data '
121
'from https://download.pytorch.org/tutorial/data.zip and extract it to '
122
'the current directory.')
123
124
print('# categories:', n_categories, all_categories)
125
print(unicodeToAscii("O'Néàl"))
126
127
128
######################################################################
129
# Creating the Network
130
# ====================
131
#
132
# This network extends `the last tutorial's RNN <#Creating-the-Network>`__
133
# with an extra argument for the category tensor, which is concatenated
134
# along with the others. The category tensor is a one-hot vector just like
135
# the letter input.
136
#
137
# We will interpret the output as the probability of the next letter. When
138
# sampling, the most likely output letter is used as the next input
139
# letter.
140
#
141
# I added a second linear layer ``o2o`` (after combining hidden and
142
# output) to give it more muscle to work with. There's also a dropout
143
# layer, which `randomly zeros parts of its
144
# input <https://arxiv.org/abs/1207.0580>`__ with a given probability
145
# (here 0.1) and is usually used to fuzz inputs to prevent overfitting.
146
# Here we're using it towards the end of the network to purposely add some
147
# chaos and increase sampling variety.
148
#
149
# .. figure:: https://i.imgur.com/jzVrf7f.png
150
# :alt:
151
#
152
#
153
154
import torch
155
import torch.nn as nn
156
157
class RNN(nn.Module):
158
def __init__(self, input_size, hidden_size, output_size):
159
super(RNN, self).__init__()
160
self.hidden_size = hidden_size
161
162
self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
163
self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
164
self.o2o = nn.Linear(hidden_size + output_size, output_size)
165
self.dropout = nn.Dropout(0.1)
166
self.softmax = nn.LogSoftmax(dim=1)
167
168
def forward(self, category, input, hidden):
169
input_combined = torch.cat((category, input, hidden), 1)
170
hidden = self.i2h(input_combined)
171
output = self.i2o(input_combined)
172
output_combined = torch.cat((hidden, output), 1)
173
output = self.o2o(output_combined)
174
output = self.dropout(output)
175
output = self.softmax(output)
176
return output, hidden
177
178
def initHidden(self):
179
return torch.zeros(1, self.hidden_size)
180
181
182
######################################################################
183
# Training
184
# =========
185
# Preparing for Training
186
# ----------------------
187
#
188
# First of all, helper functions to get random pairs of (category, line):
189
#
190
191
import random
192
193
# Random item from a list
194
def randomChoice(l):
195
return l[random.randint(0, len(l) - 1)]
196
197
# Get a random category and random line from that category
198
def randomTrainingPair():
199
category = randomChoice(all_categories)
200
line = randomChoice(category_lines[category])
201
return category, line
202
203
204
######################################################################
205
# For each timestep (that is, for each letter in a training word) the
206
# inputs of the network will be
207
# ``(category, current letter, hidden state)`` and the outputs will be
208
# ``(next letter, next hidden state)``. So for each training set, we'll
209
# need the category, a set of input letters, and a set of output/target
210
# letters.
211
#
212
# Since we are predicting the next letter from the current letter for each
213
# timestep, the letter pairs are groups of consecutive letters from the
214
# line - e.g. for ``"ABCD<EOS>"`` we would create ("A", "B"), ("B", "C"),
215
# ("C", "D"), ("D", "EOS").
216
#
217
# .. figure:: https://i.imgur.com/JH58tXY.png
218
# :alt:
219
#
220
# The category tensor is a `one-hot
221
# tensor <https://en.wikipedia.org/wiki/One-hot>`__ of size
222
# ``<1 x n_categories>``. When training we feed it to the network at every
223
# timestep - this is a design choice, it could have been included as part
224
# of initial hidden state or some other strategy.
225
#
226
227
# One-hot vector for category
228
def categoryTensor(category):
229
li = all_categories.index(category)
230
tensor = torch.zeros(1, n_categories)
231
tensor[0][li] = 1
232
return tensor
233
234
# One-hot matrix of first to last letters (not including EOS) for input
235
def inputTensor(line):
236
tensor = torch.zeros(len(line), 1, n_letters)
237
for li in range(len(line)):
238
letter = line[li]
239
tensor[li][0][all_letters.find(letter)] = 1
240
return tensor
241
242
# ``LongTensor`` of second letter to end (EOS) for target
243
def targetTensor(line):
244
letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
245
letter_indexes.append(n_letters - 1) # EOS
246
return torch.LongTensor(letter_indexes)
247
248
249
######################################################################
250
# For convenience during training we'll make a ``randomTrainingExample``
251
# function that fetches a random (category, line) pair and turns them into
252
# the required (category, input, target) tensors.
253
#
254
255
# Make category, input, and target tensors from a random category, line pair
256
def randomTrainingExample():
257
category, line = randomTrainingPair()
258
category_tensor = categoryTensor(category)
259
input_line_tensor = inputTensor(line)
260
target_line_tensor = targetTensor(line)
261
return category_tensor, input_line_tensor, target_line_tensor
262
263
264
######################################################################
265
# Training the Network
266
# --------------------
267
#
268
# In contrast to classification, where only the last output is used, we
269
# are making a prediction at every step, so we are calculating loss at
270
# every step.
271
#
272
# The magic of autograd allows you to simply sum these losses at each step
273
# and call backward at the end.
274
#
275
276
criterion = nn.NLLLoss()
277
278
learning_rate = 0.0005
279
280
def train(category_tensor, input_line_tensor, target_line_tensor):
281
target_line_tensor.unsqueeze_(-1)
282
hidden = rnn.initHidden()
283
284
rnn.zero_grad()
285
286
loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``
287
288
for i in range(input_line_tensor.size(0)):
289
output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
290
l = criterion(output, target_line_tensor[i])
291
loss += l
292
293
loss.backward()
294
295
for p in rnn.parameters():
296
p.data.add_(p.grad.data, alpha=-learning_rate)
297
298
return output, loss.item() / input_line_tensor.size(0)
299
300
301
######################################################################
302
# To keep track of how long training takes I am adding a
303
# ``timeSince(timestamp)`` function which returns a human readable string:
304
#
305
306
import time
307
import math
308
309
def timeSince(since):
310
now = time.time()
311
s = now - since
312
m = math.floor(s / 60)
313
s -= m * 60
314
return '%dm %ds' % (m, s)
315
316
317
######################################################################
318
# Training is business as usual - call train a bunch of times and wait a
319
# few minutes, printing the current time and loss every ``print_every``
320
# examples, and keeping store of an average loss per ``plot_every`` examples
321
# in ``all_losses`` for plotting later.
322
#
323
324
rnn = RNN(n_letters, 128, n_letters)
325
326
n_iters = 100000
327
print_every = 5000
328
plot_every = 500
329
all_losses = []
330
total_loss = 0 # Reset every ``plot_every`` ``iters``
331
332
start = time.time()
333
334
for iter in range(1, n_iters + 1):
335
output, loss = train(*randomTrainingExample())
336
total_loss += loss
337
338
if iter % print_every == 0:
339
print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))
340
341
if iter % plot_every == 0:
342
all_losses.append(total_loss / plot_every)
343
total_loss = 0
344
345
346
######################################################################
347
# Plotting the Losses
348
# -------------------
349
#
350
# Plotting the historical loss from all\_losses shows the network
351
# learning:
352
#
353
354
import matplotlib.pyplot as plt
355
356
plt.figure()
357
plt.plot(all_losses)
358
359
360
######################################################################
361
# Sampling the Network
362
# ====================
363
#
364
# To sample we give the network a letter and ask what the next one is,
365
# feed that in as the next letter, and repeat until the EOS token.
366
#
367
# - Create tensors for input category, starting letter, and empty hidden
368
# state
369
# - Create a string ``output_name`` with the starting letter
370
# - Up to a maximum output length,
371
#
372
# - Feed the current letter to the network
373
# - Get the next letter from highest output, and next hidden state
374
# - If the letter is EOS, stop here
375
# - If a regular letter, add to ``output_name`` and continue
376
#
377
# - Return the final name
378
#
379
# .. note::
380
# Rather than having to give it a starting letter, another
381
# strategy would have been to include a "start of string" token in
382
# training and have the network choose its own starting letter.
383
#
384
385
max_length = 20
386
387
# Sample from a category and starting letter
388
def sample(category, start_letter='A'):
389
with torch.no_grad(): # no need to track history in sampling
390
category_tensor = categoryTensor(category)
391
input = inputTensor(start_letter)
392
hidden = rnn.initHidden()
393
394
output_name = start_letter
395
396
for i in range(max_length):
397
output, hidden = rnn(category_tensor, input[0], hidden)
398
topv, topi = output.topk(1)
399
topi = topi[0][0]
400
if topi == n_letters - 1:
401
break
402
else:
403
letter = all_letters[topi]
404
output_name += letter
405
input = inputTensor(letter)
406
407
return output_name
408
409
# Get multiple samples from one category and multiple starting letters
410
def samples(category, start_letters='ABC'):
411
for start_letter in start_letters:
412
print(sample(category, start_letter))
413
414
samples('Russian', 'RUS')
415
416
samples('German', 'GER')
417
418
samples('Spanish', 'SPA')
419
420
samples('Chinese', 'CHI')
421
422
423
######################################################################
424
# Exercises
425
# =========
426
#
427
# - Try with a different dataset of category -> line, for example:
428
#
429
# - Fictional series -> Character name
430
# - Part of speech -> Word
431
# - Country -> City
432
#
433
# - Use a "start of sentence" token so that sampling can be done without
434
# choosing a start letter
435
# - Get better results with a bigger and/or better shaped network
436
#
437
# - Try the ``nn.LSTM`` and ``nn.GRU`` layers
438
# - Combine multiple of these RNNs as a higher level network
439
#
440
441