CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/char_rnn_classification_tutorial.py
Views: 494
1
# -*- coding: utf-8 -*-
2
"""
3
NLP From Scratch: Classifying Names with a Character-Level RNN
4
**************************************************************
5
**Author**: `Sean Robertson <https://github.com/spro>`_
6
7
We will be building and training a basic character-level Recurrent Neural
8
Network (RNN) to classify words. This tutorial, along with two other
9
Natural Language Processing (NLP) "from scratch" tutorials
10
:doc:`/intermediate/char_rnn_generation_tutorial` and
11
:doc:`/intermediate/seq2seq_translation_tutorial`, show how to
12
preprocess data to model NLP. In particular these tutorials do not
13
use many of the convenience functions of `torchtext`, so you can see how
14
preprocessing to model NLP works at a low level.
15
16
A character-level RNN reads words as a series of characters -
17
outputting a prediction and "hidden state" at each step, feeding its
18
previous hidden state into each next step. We take the final prediction
19
to be the output, i.e. which class the word belongs to.
20
21
Specifically, we'll train on a few thousand surnames from 18 languages
22
of origin, and predict which language a name is from based on the
23
spelling:
24
25
.. code-block:: sh
26
27
$ python predict.py Hinton
28
(-0.47) Scottish
29
(-1.52) English
30
(-3.57) Irish
31
32
$ python predict.py Schmidhuber
33
(-0.19) German
34
(-2.48) Czech
35
(-2.68) Dutch
36
37
38
Recommended Preparation
39
=======================
40
41
Before starting this tutorial it is recommended that you have installed PyTorch,
42
and have a basic understanding of Python programming language and Tensors:
43
44
- https://pytorch.org/ For installation instructions
45
- :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
46
and learn the basics of Tensors
47
- :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
48
- :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user
49
50
It would also be useful to know about RNNs and how they work:
51
52
- `The Unreasonable Effectiveness of Recurrent Neural
53
Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__
54
shows a bunch of real life examples
55
- `Understanding LSTM
56
Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__
57
is about LSTMs specifically but also informative about RNNs in
58
general
59
60
Preparing the Data
61
==================
62
63
.. note::
64
Download the data from
65
`here <https://download.pytorch.org/tutorial/data.zip>`_
66
and extract it to the current directory.
67
68
Included in the ``data/names`` directory are 18 text files named as
69
``[Language].txt``. Each file contains a bunch of names, one name per
70
line, mostly romanized (but we still need to convert from Unicode to
71
ASCII).
72
73
We'll end up with a dictionary of lists of names per language,
74
``{language: [names ...]}``. The generic variables "category" and "line"
75
(for language and name in our case) are used for later extensibility.
76
"""
77
from io import open
78
import glob
79
import os
80
81
def findFiles(path): return glob.glob(path)
82
83
print(findFiles('data/names/*.txt'))
84
85
import unicodedata
86
import string
87
88
all_letters = string.ascii_letters + " .,;'"
89
n_letters = len(all_letters)
90
91
# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
92
def unicodeToAscii(s):
93
return ''.join(
94
c for c in unicodedata.normalize('NFD', s)
95
if unicodedata.category(c) != 'Mn'
96
and c in all_letters
97
)
98
99
print(unicodeToAscii('Ślusàrski'))
100
101
# Build the category_lines dictionary, a list of names per language
102
category_lines = {}
103
all_categories = []
104
105
# Read a file and split into lines
106
def readLines(filename):
107
lines = open(filename, encoding='utf-8').read().strip().split('\n')
108
return [unicodeToAscii(line) for line in lines]
109
110
for filename in findFiles('data/names/*.txt'):
111
category = os.path.splitext(os.path.basename(filename))[0]
112
all_categories.append(category)
113
lines = readLines(filename)
114
category_lines[category] = lines
115
116
n_categories = len(all_categories)
117
118
119
######################################################################
120
# Now we have ``category_lines``, a dictionary mapping each category
121
# (language) to a list of lines (names). We also kept track of
122
# ``all_categories`` (just a list of languages) and ``n_categories`` for
123
# later reference.
124
#
125
126
print(category_lines['Italian'][:5])
127
128
129
######################################################################
130
# Turning Names into Tensors
131
# --------------------------
132
#
133
# Now that we have all the names organized, we need to turn them into
134
# Tensors to make any use of them.
135
#
136
# To represent a single letter, we use a "one-hot vector" of size
137
# ``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1
138
# at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.
139
#
140
# To make a word we join a bunch of those into a 2D matrix
141
# ``<line_length x 1 x n_letters>``.
142
#
143
# That extra 1 dimension is because PyTorch assumes everything is in
144
# batches - we're just using a batch size of 1 here.
145
#
146
147
import torch
148
149
# Find letter index from all_letters, e.g. "a" = 0
150
def letterToIndex(letter):
151
return all_letters.find(letter)
152
153
# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
154
def letterToTensor(letter):
155
tensor = torch.zeros(1, n_letters)
156
tensor[0][letterToIndex(letter)] = 1
157
return tensor
158
159
# Turn a line into a <line_length x 1 x n_letters>,
160
# or an array of one-hot letter vectors
161
def lineToTensor(line):
162
tensor = torch.zeros(len(line), 1, n_letters)
163
for li, letter in enumerate(line):
164
tensor[li][0][letterToIndex(letter)] = 1
165
return tensor
166
167
print(letterToTensor('J'))
168
169
print(lineToTensor('Jones').size())
170
171
172
######################################################################
173
# Creating the Network
174
# ====================
175
#
176
# Before autograd, creating a recurrent neural network in Torch involved
177
# cloning the parameters of a layer over several timesteps. The layers
178
# held hidden state and gradients which are now entirely handled by the
179
# graph itself. This means you can implement a RNN in a very "pure" way,
180
# as regular feed-forward layers.
181
#
182
# This RNN module implements a "vanilla RNN" an is just 3 linear layers
183
# which operate on an input and hidden state, with a ``LogSoftmax`` layer
184
# after the output.
185
#
186
187
import torch.nn as nn
188
import torch.nn.functional as F
189
190
class RNN(nn.Module):
191
def __init__(self, input_size, hidden_size, output_size):
192
super(RNN, self).__init__()
193
194
self.hidden_size = hidden_size
195
196
self.i2h = nn.Linear(input_size, hidden_size)
197
self.h2h = nn.Linear(hidden_size, hidden_size)
198
self.h2o = nn.Linear(hidden_size, output_size)
199
self.softmax = nn.LogSoftmax(dim=1)
200
201
def forward(self, input, hidden):
202
hidden = F.tanh(self.i2h(input) + self.h2h(hidden))
203
output = self.h2o(hidden)
204
output = self.softmax(output)
205
return output, hidden
206
207
def initHidden(self):
208
return torch.zeros(1, self.hidden_size)
209
210
n_hidden = 128
211
rnn = RNN(n_letters, n_hidden, n_categories)
212
213
214
######################################################################
215
# To run a step of this network we need to pass an input (in our case, the
216
# Tensor for the current letter) and a previous hidden state (which we
217
# initialize as zeros at first). We'll get back the output (probability of
218
# each language) and a next hidden state (which we keep for the next
219
# step).
220
#
221
222
input = letterToTensor('A')
223
hidden = torch.zeros(1, n_hidden)
224
225
output, next_hidden = rnn(input, hidden)
226
227
228
######################################################################
229
# For the sake of efficiency we don't want to be creating a new Tensor for
230
# every step, so we will use ``lineToTensor`` instead of
231
# ``letterToTensor`` and use slices. This could be further optimized by
232
# precomputing batches of Tensors.
233
#
234
235
input = lineToTensor('Albert')
236
hidden = torch.zeros(1, n_hidden)
237
238
output, next_hidden = rnn(input[0], hidden)
239
print(output)
240
241
242
######################################################################
243
# As you can see the output is a ``<1 x n_categories>`` Tensor, where
244
# every item is the likelihood of that category (higher is more likely).
245
#
246
247
248
######################################################################
249
#
250
# Training
251
# ========
252
# Preparing for Training
253
# ----------------------
254
#
255
# Before going into training we should make a few helper functions. The
256
# first is to interpret the output of the network, which we know to be a
257
# likelihood of each category. We can use ``Tensor.topk`` to get the index
258
# of the greatest value:
259
#
260
261
def categoryFromOutput(output):
262
top_n, top_i = output.topk(1)
263
category_i = top_i[0].item()
264
return all_categories[category_i], category_i
265
266
print(categoryFromOutput(output))
267
268
269
######################################################################
270
# We will also want a quick way to get a training example (a name and its
271
# language):
272
#
273
274
import random
275
276
def randomChoice(l):
277
return l[random.randint(0, len(l) - 1)]
278
279
def randomTrainingExample():
280
category = randomChoice(all_categories)
281
line = randomChoice(category_lines[category])
282
category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
283
line_tensor = lineToTensor(line)
284
return category, line, category_tensor, line_tensor
285
286
for i in range(10):
287
category, line, category_tensor, line_tensor = randomTrainingExample()
288
print('category =', category, '/ line =', line)
289
290
291
######################################################################
292
# Training the Network
293
# --------------------
294
#
295
# Now all it takes to train this network is show it a bunch of examples,
296
# have it make guesses, and tell it if it's wrong.
297
#
298
# For the loss function ``nn.NLLLoss`` is appropriate, since the last
299
# layer of the RNN is ``nn.LogSoftmax``.
300
#
301
302
criterion = nn.NLLLoss()
303
304
305
######################################################################
306
# Each loop of training will:
307
#
308
# - Create input and target tensors
309
# - Create a zeroed initial hidden state
310
# - Read each letter in and
311
#
312
# - Keep hidden state for next letter
313
#
314
# - Compare final output to target
315
# - Back-propagate
316
# - Return the output and loss
317
#
318
319
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
320
321
def train(category_tensor, line_tensor):
322
hidden = rnn.initHidden()
323
324
rnn.zero_grad()
325
326
for i in range(line_tensor.size()[0]):
327
output, hidden = rnn(line_tensor[i], hidden)
328
329
loss = criterion(output, category_tensor)
330
loss.backward()
331
332
# Add parameters' gradients to their values, multiplied by learning rate
333
for p in rnn.parameters():
334
p.data.add_(p.grad.data, alpha=-learning_rate)
335
336
return output, loss.item()
337
338
339
######################################################################
340
# Now we just have to run that with a bunch of examples. Since the
341
# ``train`` function returns both the output and loss we can print its
342
# guesses and also keep track of loss for plotting. Since there are 1000s
343
# of examples we print only every ``print_every`` examples, and take an
344
# average of the loss.
345
#
346
347
import time
348
import math
349
350
n_iters = 100000
351
print_every = 5000
352
plot_every = 1000
353
354
355
356
# Keep track of losses for plotting
357
current_loss = 0
358
all_losses = []
359
360
def timeSince(since):
361
now = time.time()
362
s = now - since
363
m = math.floor(s / 60)
364
s -= m * 60
365
return '%dm %ds' % (m, s)
366
367
start = time.time()
368
369
for iter in range(1, n_iters + 1):
370
category, line, category_tensor, line_tensor = randomTrainingExample()
371
output, loss = train(category_tensor, line_tensor)
372
current_loss += loss
373
374
# Print ``iter`` number, loss, name and guess
375
if iter % print_every == 0:
376
guess, guess_i = categoryFromOutput(output)
377
correct = '✓' if guess == category else '✗ (%s)' % category
378
print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))
379
380
# Add current loss avg to list of losses
381
if iter % plot_every == 0:
382
all_losses.append(current_loss / plot_every)
383
current_loss = 0
384
385
386
######################################################################
387
# Plotting the Results
388
# --------------------
389
#
390
# Plotting the historical loss from ``all_losses`` shows the network
391
# learning:
392
#
393
394
import matplotlib.pyplot as plt
395
import matplotlib.ticker as ticker
396
397
plt.figure()
398
plt.plot(all_losses)
399
400
401
######################################################################
402
# Evaluating the Results
403
# ======================
404
#
405
# To see how well the network performs on different categories, we will
406
# create a confusion matrix, indicating for every actual language (rows)
407
# which language the network guesses (columns). To calculate the confusion
408
# matrix a bunch of samples are run through the network with
409
# ``evaluate()``, which is the same as ``train()`` minus the backprop.
410
#
411
412
# Keep track of correct guesses in a confusion matrix
413
confusion = torch.zeros(n_categories, n_categories)
414
n_confusion = 10000
415
416
# Just return an output given a line
417
def evaluate(line_tensor):
418
hidden = rnn.initHidden()
419
420
for i in range(line_tensor.size()[0]):
421
output, hidden = rnn(line_tensor[i], hidden)
422
423
return output
424
425
# Go through a bunch of examples and record which are correctly guessed
426
for i in range(n_confusion):
427
category, line, category_tensor, line_tensor = randomTrainingExample()
428
output = evaluate(line_tensor)
429
guess, guess_i = categoryFromOutput(output)
430
category_i = all_categories.index(category)
431
confusion[category_i][guess_i] += 1
432
433
# Normalize by dividing every row by its sum
434
for i in range(n_categories):
435
confusion[i] = confusion[i] / confusion[i].sum()
436
437
# Set up plot
438
fig = plt.figure()
439
ax = fig.add_subplot(111)
440
cax = ax.matshow(confusion.numpy())
441
fig.colorbar(cax)
442
443
# Set up axes
444
ax.set_xticklabels([''] + all_categories, rotation=90)
445
ax.set_yticklabels([''] + all_categories)
446
447
# Force label at every tick
448
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
449
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
450
451
# sphinx_gallery_thumbnail_number = 2
452
plt.show()
453
454
455
######################################################################
456
# You can pick out bright spots off the main axis that show which
457
# languages it guesses incorrectly, e.g. Chinese for Korean, and Spanish
458
# for Italian. It seems to do very well with Greek, and very poorly with
459
# English (perhaps because of overlap with other languages).
460
#
461
462
463
######################################################################
464
# Running on User Input
465
# ---------------------
466
#
467
468
def predict(input_line, n_predictions=3):
469
print('\n> %s' % input_line)
470
with torch.no_grad():
471
output = evaluate(lineToTensor(input_line))
472
473
# Get top N categories
474
topv, topi = output.topk(n_predictions, 1, True)
475
predictions = []
476
477
for i in range(n_predictions):
478
value = topv[0][i].item()
479
category_index = topi[0][i].item()
480
print('(%.2f) %s' % (value, all_categories[category_index]))
481
predictions.append([value, all_categories[category_index]])
482
483
predict('Dovesky')
484
predict('Jackson')
485
predict('Satoshi')
486
487
488
######################################################################
489
# The final versions of the scripts `in the Practical PyTorch
490
# repo <https://github.com/spro/practical-pytorch/tree/master/char-rnn-classification>`__
491
# split the above code into a few files:
492
#
493
# - ``data.py`` (loads files)
494
# - ``model.py`` (defines the RNN)
495
# - ``train.py`` (runs training)
496
# - ``predict.py`` (runs ``predict()`` with command line arguments)
497
# - ``server.py`` (serve prediction as a JSON API with ``bottle.py``)
498
#
499
# Run ``train.py`` to train and save the network.
500
#
501
# Run ``predict.py`` with a name to view predictions:
502
#
503
# .. code-block:: sh
504
#
505
# $ python predict.py Hazaki
506
# (-0.42) Japanese
507
# (-1.39) Polish
508
# (-3.51) Czech
509
#
510
# Run ``server.py`` and visit http://localhost:5533/Yourname to get JSON
511
# output of predictions.
512
#
513
514
515
######################################################################
516
# Exercises
517
# =========
518
#
519
# - Try with a different dataset of line -> category, for example:
520
#
521
# - Any word -> language
522
# - First name -> gender
523
# - Character name -> writer
524
# - Page title -> blog or subreddit
525
#
526
# - Get better results with a bigger and/or better shaped network
527
#
528
# - Add more linear layers
529
# - Try the ``nn.LSTM`` and ``nn.GRU`` layers
530
# - Combine multiple of these RNNs as a higher level network
531
#
532
533