CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/char_rnn_generation_tutorial.py
Views: 494
1
# -*- coding: utf-8 -*-
2
"""
3
NLP From Scratch: Generating Names with a Character-Level RNN
4
*************************************************************
5
**Author**: `Sean Robertson <https://github.com/spro>`_
6
7
This is our second of three tutorials on "NLP From Scratch".
8
In the `first tutorial </tutorials/intermediate/char_rnn_classification_tutorial>`_
9
we used a RNN to classify names into their language of origin. This time
10
we'll turn around and generate names from languages.
11
12
.. code-block:: sh
13
14
> python sample.py Russian RUS
15
Rovakov
16
Uantov
17
Shavakov
18
19
> python sample.py German GER
20
Gerren
21
Ereng
22
Rosher
23
24
> python sample.py Spanish SPA
25
Salla
26
Parer
27
Allan
28
29
> python sample.py Chinese CHI
30
Chan
31
Hang
32
Iun
33
34
We are still hand-crafting a small RNN with a few linear layers. The big
35
difference is instead of predicting a category after reading in all the
36
letters of a name, we input a category and output one letter at a time.
37
Recurrently predicting characters to form language (this could also be
38
done with words or other higher order constructs) is often referred to
39
as a "language model".
40
41
**Recommended Reading:**
42
43
I assume you have at least installed PyTorch, know Python, and
44
understand Tensors:
45
46
- https://pytorch.org/ For installation instructions
47
- :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
48
- :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
49
- :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user
50
51
It would also be useful to know about RNNs and how they work:
52
53
- `The Unreasonable Effectiveness of Recurrent Neural
54
Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__
55
shows a bunch of real life examples
56
- `Understanding LSTM
57
Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__
58
is about LSTMs specifically but also informative about RNNs in
59
general
60
61
I also suggest the previous tutorial, :doc:`/intermediate/char_rnn_classification_tutorial`
62
63
64
Preparing the Data
65
==================
66
67
.. note::
68
Download the data from
69
`here <https://download.pytorch.org/tutorial/data.zip>`_
70
and extract it to the current directory.
71
72
See the last tutorial for more detail of this process. In short, there
73
are a bunch of plain text files ``data/names/[Language].txt`` with a
74
name per line. We split lines into an array, convert Unicode to ASCII,
75
and end up with a dictionary ``{language: [names ...]}``.
76
77
"""
78
from io import open
79
import glob
80
import os
81
import unicodedata
82
import string
83
84
all_letters = string.ascii_letters + " .,;'-"
85
n_letters = len(all_letters) + 1 # Plus EOS marker
86
87
def findFiles(path): return glob.glob(path)
88
89
# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
90
def unicodeToAscii(s):
91
return ''.join(
92
c for c in unicodedata.normalize('NFD', s)
93
if unicodedata.category(c) != 'Mn'
94
and c in all_letters
95
)
96
97
# Read a file and split into lines
98
def readLines(filename):
99
with open(filename, encoding='utf-8') as some_file:
100
return [unicodeToAscii(line.strip()) for line in some_file]
101
102
# Build the category_lines dictionary, a list of lines per category
103
category_lines = {}
104
all_categories = []
105
for filename in findFiles('data/names/*.txt'):
106
category = os.path.splitext(os.path.basename(filename))[0]
107
all_categories.append(category)
108
lines = readLines(filename)
109
category_lines[category] = lines
110
111
n_categories = len(all_categories)
112
113
if n_categories == 0:
114
raise RuntimeError('Data not found. Make sure that you downloaded data '
115
'from https://download.pytorch.org/tutorial/data.zip and extract it to '
116
'the current directory.')
117
118
print('# categories:', n_categories, all_categories)
119
print(unicodeToAscii("O'Néàl"))
120
121
122
######################################################################
123
# Creating the Network
124
# ====================
125
#
126
# This network extends `the last tutorial's RNN <#Creating-the-Network>`__
127
# with an extra argument for the category tensor, which is concatenated
128
# along with the others. The category tensor is a one-hot vector just like
129
# the letter input.
130
#
131
# We will interpret the output as the probability of the next letter. When
132
# sampling, the most likely output letter is used as the next input
133
# letter.
134
#
135
# I added a second linear layer ``o2o`` (after combining hidden and
136
# output) to give it more muscle to work with. There's also a dropout
137
# layer, which `randomly zeros parts of its
138
# input <https://arxiv.org/abs/1207.0580>`__ with a given probability
139
# (here 0.1) and is usually used to fuzz inputs to prevent overfitting.
140
# Here we're using it towards the end of the network to purposely add some
141
# chaos and increase sampling variety.
142
#
143
# .. figure:: https://i.imgur.com/jzVrf7f.png
144
# :alt:
145
#
146
#
147
148
import torch
149
import torch.nn as nn
150
151
class RNN(nn.Module):
152
def __init__(self, input_size, hidden_size, output_size):
153
super(RNN, self).__init__()
154
self.hidden_size = hidden_size
155
156
self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
157
self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
158
self.o2o = nn.Linear(hidden_size + output_size, output_size)
159
self.dropout = nn.Dropout(0.1)
160
self.softmax = nn.LogSoftmax(dim=1)
161
162
def forward(self, category, input, hidden):
163
input_combined = torch.cat((category, input, hidden), 1)
164
hidden = self.i2h(input_combined)
165
output = self.i2o(input_combined)
166
output_combined = torch.cat((hidden, output), 1)
167
output = self.o2o(output_combined)
168
output = self.dropout(output)
169
output = self.softmax(output)
170
return output, hidden
171
172
def initHidden(self):
173
return torch.zeros(1, self.hidden_size)
174
175
176
######################################################################
177
# Training
178
# =========
179
# Preparing for Training
180
# ----------------------
181
#
182
# First of all, helper functions to get random pairs of (category, line):
183
#
184
185
import random
186
187
# Random item from a list
188
def randomChoice(l):
189
return l[random.randint(0, len(l) - 1)]
190
191
# Get a random category and random line from that category
192
def randomTrainingPair():
193
category = randomChoice(all_categories)
194
line = randomChoice(category_lines[category])
195
return category, line
196
197
198
######################################################################
199
# For each timestep (that is, for each letter in a training word) the
200
# inputs of the network will be
201
# ``(category, current letter, hidden state)`` and the outputs will be
202
# ``(next letter, next hidden state)``. So for each training set, we'll
203
# need the category, a set of input letters, and a set of output/target
204
# letters.
205
#
206
# Since we are predicting the next letter from the current letter for each
207
# timestep, the letter pairs are groups of consecutive letters from the
208
# line - e.g. for ``"ABCD<EOS>"`` we would create ("A", "B"), ("B", "C"),
209
# ("C", "D"), ("D", "EOS").
210
#
211
# .. figure:: https://i.imgur.com/JH58tXY.png
212
# :alt:
213
#
214
# The category tensor is a `one-hot
215
# tensor <https://en.wikipedia.org/wiki/One-hot>`__ of size
216
# ``<1 x n_categories>``. When training we feed it to the network at every
217
# timestep - this is a design choice, it could have been included as part
218
# of initial hidden state or some other strategy.
219
#
220
221
# One-hot vector for category
222
def categoryTensor(category):
223
li = all_categories.index(category)
224
tensor = torch.zeros(1, n_categories)
225
tensor[0][li] = 1
226
return tensor
227
228
# One-hot matrix of first to last letters (not including EOS) for input
229
def inputTensor(line):
230
tensor = torch.zeros(len(line), 1, n_letters)
231
for li in range(len(line)):
232
letter = line[li]
233
tensor[li][0][all_letters.find(letter)] = 1
234
return tensor
235
236
# ``LongTensor`` of second letter to end (EOS) for target
237
def targetTensor(line):
238
letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
239
letter_indexes.append(n_letters - 1) # EOS
240
return torch.LongTensor(letter_indexes)
241
242
243
######################################################################
244
# For convenience during training we'll make a ``randomTrainingExample``
245
# function that fetches a random (category, line) pair and turns them into
246
# the required (category, input, target) tensors.
247
#
248
249
# Make category, input, and target tensors from a random category, line pair
250
def randomTrainingExample():
251
category, line = randomTrainingPair()
252
category_tensor = categoryTensor(category)
253
input_line_tensor = inputTensor(line)
254
target_line_tensor = targetTensor(line)
255
return category_tensor, input_line_tensor, target_line_tensor
256
257
258
######################################################################
259
# Training the Network
260
# --------------------
261
#
262
# In contrast to classification, where only the last output is used, we
263
# are making a prediction at every step, so we are calculating loss at
264
# every step.
265
#
266
# The magic of autograd allows you to simply sum these losses at each step
267
# and call backward at the end.
268
#
269
270
criterion = nn.NLLLoss()
271
272
learning_rate = 0.0005
273
274
def train(category_tensor, input_line_tensor, target_line_tensor):
275
target_line_tensor.unsqueeze_(-1)
276
hidden = rnn.initHidden()
277
278
rnn.zero_grad()
279
280
loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``
281
282
for i in range(input_line_tensor.size(0)):
283
output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
284
l = criterion(output, target_line_tensor[i])
285
loss += l
286
287
loss.backward()
288
289
for p in rnn.parameters():
290
p.data.add_(p.grad.data, alpha=-learning_rate)
291
292
return output, loss.item() / input_line_tensor.size(0)
293
294
295
######################################################################
296
# To keep track of how long training takes I am adding a
297
# ``timeSince(timestamp)`` function which returns a human readable string:
298
#
299
300
import time
301
import math
302
303
def timeSince(since):
304
now = time.time()
305
s = now - since
306
m = math.floor(s / 60)
307
s -= m * 60
308
return '%dm %ds' % (m, s)
309
310
311
######################################################################
312
# Training is business as usual - call train a bunch of times and wait a
313
# few minutes, printing the current time and loss every ``print_every``
314
# examples, and keeping store of an average loss per ``plot_every`` examples
315
# in ``all_losses`` for plotting later.
316
#
317
318
rnn = RNN(n_letters, 128, n_letters)
319
320
n_iters = 100000
321
print_every = 5000
322
plot_every = 500
323
all_losses = []
324
total_loss = 0 # Reset every ``plot_every`` ``iters``
325
326
start = time.time()
327
328
for iter in range(1, n_iters + 1):
329
output, loss = train(*randomTrainingExample())
330
total_loss += loss
331
332
if iter % print_every == 0:
333
print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))
334
335
if iter % plot_every == 0:
336
all_losses.append(total_loss / plot_every)
337
total_loss = 0
338
339
340
######################################################################
341
# Plotting the Losses
342
# -------------------
343
#
344
# Plotting the historical loss from all\_losses shows the network
345
# learning:
346
#
347
348
import matplotlib.pyplot as plt
349
350
plt.figure()
351
plt.plot(all_losses)
352
353
354
######################################################################
355
# Sampling the Network
356
# ====================
357
#
358
# To sample we give the network a letter and ask what the next one is,
359
# feed that in as the next letter, and repeat until the EOS token.
360
#
361
# - Create tensors for input category, starting letter, and empty hidden
362
# state
363
# - Create a string ``output_name`` with the starting letter
364
# - Up to a maximum output length,
365
#
366
# - Feed the current letter to the network
367
# - Get the next letter from highest output, and next hidden state
368
# - If the letter is EOS, stop here
369
# - If a regular letter, add to ``output_name`` and continue
370
#
371
# - Return the final name
372
#
373
# .. note::
374
# Rather than having to give it a starting letter, another
375
# strategy would have been to include a "start of string" token in
376
# training and have the network choose its own starting letter.
377
#
378
379
max_length = 20
380
381
# Sample from a category and starting letter
382
def sample(category, start_letter='A'):
383
with torch.no_grad(): # no need to track history in sampling
384
category_tensor = categoryTensor(category)
385
input = inputTensor(start_letter)
386
hidden = rnn.initHidden()
387
388
output_name = start_letter
389
390
for i in range(max_length):
391
output, hidden = rnn(category_tensor, input[0], hidden)
392
topv, topi = output.topk(1)
393
topi = topi[0][0]
394
if topi == n_letters - 1:
395
break
396
else:
397
letter = all_letters[topi]
398
output_name += letter
399
input = inputTensor(letter)
400
401
return output_name
402
403
# Get multiple samples from one category and multiple starting letters
404
def samples(category, start_letters='ABC'):
405
for start_letter in start_letters:
406
print(sample(category, start_letter))
407
408
samples('Russian', 'RUS')
409
410
samples('German', 'GER')
411
412
samples('Spanish', 'SPA')
413
414
samples('Chinese', 'CHI')
415
416
417
######################################################################
418
# Exercises
419
# =========
420
#
421
# - Try with a different dataset of category -> line, for example:
422
#
423
# - Fictional series -> Character name
424
# - Part of speech -> Word
425
# - Country -> City
426
#
427
# - Use a "start of sentence" token so that sampling can be done without
428
# choosing a start letter
429
# - Get better results with a bigger and/or better shaped network
430
#
431
# - Try the ``nn.LSTM`` and ``nn.GRU`` layers
432
# - Combine multiple of these RNNs as a higher level network
433
#
434
435