CoCalc -- char_rnn_generation

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/char_rnn_generation_tutorial.py
Views: ⁷¹²
1
# -*- coding: utf-8 -*-
2
"""
3
NLP From Scratch: Generating Names with a Character-Level RNN
4
*************************************************************
5
**Author**: `Sean Robertson <https://github.com/spro>`_
6

7
This tutorials is part of a three-part series:
8

9
* `NLP From Scratch: Classifying Names with a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`__
10
* `NLP From Scratch: Generating Names with a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html>`__
11
* `NLP From Scratch: Translation with a Sequence to Sequence Network and Attention <https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html>`__
12

13
This is our second of three tutorials on "NLP From Scratch".
14
In the `first tutorial </tutorials/intermediate/char_rnn_classification_tutorial>`_
15
we used a RNN to classify names into their language of origin. This time
16
we'll turn around and generate names from languages.
17

18
.. code-block:: sh
19

20
    > python sample.py Russian RUS
21
    Rovakov
22
    Uantov
23
    Shavakov
24

25
    > python sample.py German GER
26
    Gerren
27
    Ereng
28
    Rosher
29

30
    > python sample.py Spanish SPA
31
    Salla
32
    Parer
33
    Allan
34

35
    > python sample.py Chinese CHI
36
    Chan
37
    Hang
38
    Iun
39

40
We are still hand-crafting a small RNN with a few linear layers. The big
41
difference is instead of predicting a category after reading in all the
42
letters of a name, we input a category and output one letter at a time.
43
Recurrently predicting characters to form language (this could also be
44
done with words or other higher order constructs) is often referred to
45
as a "language model".
46

47
**Recommended Reading:**
48

49
I assume you have at least installed PyTorch, know Python, and
50
understand Tensors:
51

52
-  https://pytorch.org/ For installation instructions
53
-  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
54
-  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
55
-  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user
56

57
It would also be useful to know about RNNs and how they work:
58

59
-  `The Unreasonable Effectiveness of Recurrent Neural
60
   Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__
61
   shows a bunch of real life examples
62
-  `Understanding LSTM
63
   Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__
64
   is about LSTMs specifically but also informative about RNNs in
65
   general
66

67
I also suggest the previous tutorial, :doc:`/intermediate/char_rnn_classification_tutorial`
68

69

70
Preparing the Data
71
==================
72

73
.. note::
74
   Download the data from
75
   `here <https://download.pytorch.org/tutorial/data.zip>`_
76
   and extract it to the current directory.
77

78
See the last tutorial for more detail of this process. In short, there
79
are a bunch of plain text files ``data/names/[Language].txt`` with a
80
name per line. We split lines into an array, convert Unicode to ASCII,
81
and end up with a dictionary ``{language: [names ...]}``.
82

83
"""
84
from io import open
85
import glob
86
import os
87
import unicodedata
88
import string
89

90
all_letters = string.ascii_letters + " .,;'-"
91
n_letters = len(all_letters) + 1 # Plus EOS marker
92

93
def findFiles(path): return glob.glob(path)
94

95
# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
96
def unicodeToAscii(s):
97
    return ''.join(
98
        c for c in unicodedata.normalize('NFD', s)
99
        if unicodedata.category(c) != 'Mn'
100
        and c in all_letters
101
    )
102

103
# Read a file and split into lines
104
def readLines(filename):
105
    with open(filename, encoding='utf-8') as some_file:
106
        return [unicodeToAscii(line.strip()) for line in some_file]
107

108
# Build the category_lines dictionary, a list of lines per category
109
category_lines = {}
110
all_categories = []
111
for filename in findFiles('data/names/*.txt'):
112
    category = os.path.splitext(os.path.basename(filename))[0]
113
    all_categories.append(category)
114
    lines = readLines(filename)
115
    category_lines[category] = lines
116

117
n_categories = len(all_categories)
118

119
if n_categories == 0:
120
    raise RuntimeError('Data not found. Make sure that you downloaded data '
121
        'from https://download.pytorch.org/tutorial/data.zip and extract it to '
122
        'the current directory.')
123

124
print('# categories:', n_categories, all_categories)
125
print(unicodeToAscii("O'Néàl"))
126

127

128
######################################################################
129
# Creating the Network
130
# ====================
131
#
132
# This network extends `the last tutorial's RNN <#Creating-the-Network>`__
133
# with an extra argument for the category tensor, which is concatenated
134
# along with the others. The category tensor is a one-hot vector just like
135
# the letter input.
136
#
137
# We will interpret the output as the probability of the next letter. When
138
# sampling, the most likely output letter is used as the next input
139
# letter.
140
#
141
# I added a second linear layer ``o2o`` (after combining hidden and
142
# output) to give it more muscle to work with. There's also a dropout
143
# layer, which `randomly zeros parts of its
144
# input <https://arxiv.org/abs/1207.0580>`__ with a given probability
145
# (here 0.1) and is usually used to fuzz inputs to prevent overfitting.
146
# Here we're using it towards the end of the network to purposely add some
147
# chaos and increase sampling variety.
148
#
149
# .. figure:: https://i.imgur.com/jzVrf7f.png
150
#    :alt:
151
#
152
#
153

154
import torch
155
import torch.nn as nn
156

157
class RNN(nn.Module):
158
    def __init__(self, input_size, hidden_size, output_size):
159
        super(RNN, self).__init__()
160
        self.hidden_size = hidden_size
161

162
        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
163
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
164
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
165
        self.dropout = nn.Dropout(0.1)
166
        self.softmax = nn.LogSoftmax(dim=1)
167

168
    def forward(self, category, input, hidden):
169
        input_combined = torch.cat((category, input, hidden), 1)
170
        hidden = self.i2h(input_combined)
171
        output = self.i2o(input_combined)
172
        output_combined = torch.cat((hidden, output), 1)
173
        output = self.o2o(output_combined)
174
        output = self.dropout(output)
175
        output = self.softmax(output)
176
        return output, hidden
177

178
    def initHidden(self):
179
        return torch.zeros(1, self.hidden_size)
180

181

182
######################################################################
183
# Training
184
# =========
185
# Preparing for Training
186
# ----------------------
187
#
188
# First of all, helper functions to get random pairs of (category, line):
189
#
190

191
import random
192

193
# Random item from a list
194
def randomChoice(l):
195
    return l[random.randint(0, len(l) - 1)]
196

197
# Get a random category and random line from that category
198
def randomTrainingPair():
199
    category = randomChoice(all_categories)
200
    line = randomChoice(category_lines[category])
201
    return category, line
202

203

204
######################################################################
205
# For each timestep (that is, for each letter in a training word) the
206
# inputs of the network will be
207
# ``(category, current letter, hidden state)`` and the outputs will be
208
# ``(next letter, next hidden state)``. So for each training set, we'll
209
# need the category, a set of input letters, and a set of output/target
210
# letters.
211
#
212
# Since we are predicting the next letter from the current letter for each
213
# timestep, the letter pairs are groups of consecutive letters from the
214
# line - e.g. for ``"ABCD<EOS>"`` we would create ("A", "B"), ("B", "C"),
215
# ("C", "D"), ("D", "EOS").
216
#
217
# .. figure:: https://i.imgur.com/JH58tXY.png
218
#    :alt:
219
#
220
# The category tensor is a `one-hot
221
# tensor <https://en.wikipedia.org/wiki/One-hot>`__ of size
222
# ``<1 x n_categories>``. When training we feed it to the network at every
223
# timestep - this is a design choice, it could have been included as part
224
# of initial hidden state or some other strategy.
225
#
226

227
# One-hot vector for category
228
def categoryTensor(category):
229
    li = all_categories.index(category)
230
    tensor = torch.zeros(1, n_categories)
231
    tensor[0][li] = 1
232
    return tensor
233

234
# One-hot matrix of first to last letters (not including EOS) for input
235
def inputTensor(line):
236
    tensor = torch.zeros(len(line), 1, n_letters)
237
    for li in range(len(line)):
238
        letter = line[li]
239
        tensor[li][0][all_letters.find(letter)] = 1
240
    return tensor
241

242
# ``LongTensor`` of second letter to end (EOS) for target
243
def targetTensor(line):
244
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
245
    letter_indexes.append(n_letters - 1) # EOS
246
    return torch.LongTensor(letter_indexes)
247

248

249
######################################################################
250
# For convenience during training we'll make a ``randomTrainingExample``
251
# function that fetches a random (category, line) pair and turns them into
252
# the required (category, input, target) tensors.
253
#
254

255
# Make category, input, and target tensors from a random category, line pair
256
def randomTrainingExample():
257
    category, line = randomTrainingPair()
258
    category_tensor = categoryTensor(category)
259
    input_line_tensor = inputTensor(line)
260
    target_line_tensor = targetTensor(line)
261
    return category_tensor, input_line_tensor, target_line_tensor
262

263

264
######################################################################
265
# Training the Network
266
# --------------------
267
#
268
# In contrast to classification, where only the last output is used, we
269
# are making a prediction at every step, so we are calculating loss at
270
# every step.
271
#
272
# The magic of autograd allows you to simply sum these losses at each step
273
# and call backward at the end.
274
#
275

276
criterion = nn.NLLLoss()
277

278
learning_rate = 0.0005
279

280
def train(category_tensor, input_line_tensor, target_line_tensor):
281
    target_line_tensor.unsqueeze_(-1)
282
    hidden = rnn.initHidden()
283

284
    rnn.zero_grad()
285

286
    loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``
287

288
    for i in range(input_line_tensor.size(0)):
289
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
290
        l = criterion(output, target_line_tensor[i])
291
        loss += l
292

293
    loss.backward()
294

295
    for p in rnn.parameters():
296
        p.data.add_(p.grad.data, alpha=-learning_rate)
297

298
    return output, loss.item() / input_line_tensor.size(0)
299

300

301
######################################################################
302
# To keep track of how long training takes I am adding a
303
# ``timeSince(timestamp)`` function which returns a human readable string:
304
#
305

306
import time
307
import math
308

309
def timeSince(since):
310
    now = time.time()
311
    s = now - since
312
    m = math.floor(s / 60)
313
    s -= m * 60
314
    return '%dm %ds' % (m, s)
315

316

317
######################################################################
318
# Training is business as usual - call train a bunch of times and wait a
319
# few minutes, printing the current time and loss every ``print_every``
320
# examples, and keeping store of an average loss per ``plot_every`` examples
321
# in ``all_losses`` for plotting later.
322
#
323

324
rnn = RNN(n_letters, 128, n_letters)
325

326
n_iters = 100000
327
print_every = 5000
328
plot_every = 500
329
all_losses = []
330
total_loss = 0 # Reset every ``plot_every`` ``iters``
331

332
start = time.time()
333

334
for iter in range(1, n_iters + 1):
335
    output, loss = train(*randomTrainingExample())
336
    total_loss += loss
337

338
    if iter % print_every == 0:
339
        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))
340

341
    if iter % plot_every == 0:
342
        all_losses.append(total_loss / plot_every)
343
        total_loss = 0
344

345

346
######################################################################
347
# Plotting the Losses
348
# -------------------
349
#
350
# Plotting the historical loss from all\_losses shows the network
351
# learning:
352
#
353

354
import matplotlib.pyplot as plt
355

356
plt.figure()
357
plt.plot(all_losses)
358

359

360
######################################################################
361
# Sampling the Network
362
# ====================
363
#
364
# To sample we give the network a letter and ask what the next one is,
365
# feed that in as the next letter, and repeat until the EOS token.
366
#
367
# -  Create tensors for input category, starting letter, and empty hidden
368
#    state
369
# -  Create a string ``output_name`` with the starting letter
370
# -  Up to a maximum output length,
371
#
372
#    -  Feed the current letter to the network
373
#    -  Get the next letter from highest output, and next hidden state
374
#    -  If the letter is EOS, stop here
375
#    -  If a regular letter, add to ``output_name`` and continue
376
#
377
# -  Return the final name
378
#
379
# .. note::
380
#    Rather than having to give it a starting letter, another
381
#    strategy would have been to include a "start of string" token in
382
#    training and have the network choose its own starting letter.
383
#
384

385
max_length = 20
386

387
# Sample from a category and starting letter
388
def sample(category, start_letter='A'):
389
    with torch.no_grad():  # no need to track history in sampling
390
        category_tensor = categoryTensor(category)
391
        input = inputTensor(start_letter)
392
        hidden = rnn.initHidden()
393

394
        output_name = start_letter
395

396
        for i in range(max_length):
397
            output, hidden = rnn(category_tensor, input[0], hidden)
398
            topv, topi = output.topk(1)
399
            topi = topi[0][0]
400
            if topi == n_letters - 1:
401
                break
402
            else:
403
                letter = all_letters[topi]
404
                output_name += letter
405
            input = inputTensor(letter)
406

407
        return output_name
408

409
# Get multiple samples from one category and multiple starting letters
410
def samples(category, start_letters='ABC'):
411
    for start_letter in start_letters:
412
        print(sample(category, start_letter))
413

414
samples('Russian', 'RUS')
415

416
samples('German', 'GER')
417

418
samples('Spanish', 'SPA')
419

420
samples('Chinese', 'CHI')
421

422

423
######################################################################
424
# Exercises
425
# =========
426
#
427
# -  Try with a different dataset of category -> line, for example:
428
#
429
#    -  Fictional series -> Character name
430
#    -  Part of speech -> Word
431
#    -  Country -> City
432
#
433
# -  Use a "start of sentence" token so that sampling can be done without
434
#    choosing a start letter
435
# -  Get better results with a bigger and/or better shaped network
436
#
437
#    -  Try the ``nn.LSTM`` and ``nn.GRU`` layers
438
#    -  Combine multiple of these RNNs as a higher level network
439
#
440

441
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place.

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.