CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/intermediate_source/char_rnn_classification_tutorial.py
Views: 494
# -*- coding: utf-8 -*-1"""2NLP From Scratch: Classifying Names with a Character-Level RNN3**************************************************************4**Author**: `Sean Robertson <https://github.com/spro>`_56We will be building and training a basic character-level Recurrent Neural7Network (RNN) to classify words. This tutorial, along with two other8Natural Language Processing (NLP) "from scratch" tutorials9:doc:`/intermediate/char_rnn_generation_tutorial` and10:doc:`/intermediate/seq2seq_translation_tutorial`, show how to11preprocess data to model NLP. In particular these tutorials do not12use many of the convenience functions of `torchtext`, so you can see how13preprocessing to model NLP works at a low level.1415A character-level RNN reads words as a series of characters -16outputting a prediction and "hidden state" at each step, feeding its17previous hidden state into each next step. We take the final prediction18to be the output, i.e. which class the word belongs to.1920Specifically, we'll train on a few thousand surnames from 18 languages21of origin, and predict which language a name is from based on the22spelling:2324.. code-block:: sh2526$ python predict.py Hinton27(-0.47) Scottish28(-1.52) English29(-3.57) Irish3031$ python predict.py Schmidhuber32(-0.19) German33(-2.48) Czech34(-2.68) Dutch353637Recommended Preparation38=======================3940Before starting this tutorial it is recommended that you have installed PyTorch,41and have a basic understanding of Python programming language and Tensors:4243- https://pytorch.org/ For installation instructions44- :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general45and learn the basics of Tensors46- :doc:`/beginner/pytorch_with_examples` for a wide and deep overview47- :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user4849It would also be useful to know about RNNs and how they work:5051- `The Unreasonable Effectiveness of Recurrent Neural52Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__53shows a bunch of real life examples54- `Understanding LSTM55Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__56is about LSTMs specifically but also informative about RNNs in57general5859Preparing the Data60==================6162.. note::63Download the data from64`here <https://download.pytorch.org/tutorial/data.zip>`_65and extract it to the current directory.6667Included in the ``data/names`` directory are 18 text files named as68``[Language].txt``. Each file contains a bunch of names, one name per69line, mostly romanized (but we still need to convert from Unicode to70ASCII).7172We'll end up with a dictionary of lists of names per language,73``{language: [names ...]}``. The generic variables "category" and "line"74(for language and name in our case) are used for later extensibility.75"""76from io import open77import glob78import os7980def findFiles(path): return glob.glob(path)8182print(findFiles('data/names/*.txt'))8384import unicodedata85import string8687all_letters = string.ascii_letters + " .,;'"88n_letters = len(all_letters)8990# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/280942791def unicodeToAscii(s):92return ''.join(93c for c in unicodedata.normalize('NFD', s)94if unicodedata.category(c) != 'Mn'95and c in all_letters96)9798print(unicodeToAscii('Ślusàrski'))99100# Build the category_lines dictionary, a list of names per language101category_lines = {}102all_categories = []103104# Read a file and split into lines105def readLines(filename):106lines = open(filename, encoding='utf-8').read().strip().split('\n')107return [unicodeToAscii(line) for line in lines]108109for filename in findFiles('data/names/*.txt'):110category = os.path.splitext(os.path.basename(filename))[0]111all_categories.append(category)112lines = readLines(filename)113category_lines[category] = lines114115n_categories = len(all_categories)116117118######################################################################119# Now we have ``category_lines``, a dictionary mapping each category120# (language) to a list of lines (names). We also kept track of121# ``all_categories`` (just a list of languages) and ``n_categories`` for122# later reference.123#124125print(category_lines['Italian'][:5])126127128######################################################################129# Turning Names into Tensors130# --------------------------131#132# Now that we have all the names organized, we need to turn them into133# Tensors to make any use of them.134#135# To represent a single letter, we use a "one-hot vector" of size136# ``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1137# at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.138#139# To make a word we join a bunch of those into a 2D matrix140# ``<line_length x 1 x n_letters>``.141#142# That extra 1 dimension is because PyTorch assumes everything is in143# batches - we're just using a batch size of 1 here.144#145146import torch147148# Find letter index from all_letters, e.g. "a" = 0149def letterToIndex(letter):150return all_letters.find(letter)151152# Just for demonstration, turn a letter into a <1 x n_letters> Tensor153def letterToTensor(letter):154tensor = torch.zeros(1, n_letters)155tensor[0][letterToIndex(letter)] = 1156return tensor157158# Turn a line into a <line_length x 1 x n_letters>,159# or an array of one-hot letter vectors160def lineToTensor(line):161tensor = torch.zeros(len(line), 1, n_letters)162for li, letter in enumerate(line):163tensor[li][0][letterToIndex(letter)] = 1164return tensor165166print(letterToTensor('J'))167168print(lineToTensor('Jones').size())169170171######################################################################172# Creating the Network173# ====================174#175# Before autograd, creating a recurrent neural network in Torch involved176# cloning the parameters of a layer over several timesteps. The layers177# held hidden state and gradients which are now entirely handled by the178# graph itself. This means you can implement a RNN in a very "pure" way,179# as regular feed-forward layers.180#181# This RNN module implements a "vanilla RNN" an is just 3 linear layers182# which operate on an input and hidden state, with a ``LogSoftmax`` layer183# after the output.184#185186import torch.nn as nn187import torch.nn.functional as F188189class RNN(nn.Module):190def __init__(self, input_size, hidden_size, output_size):191super(RNN, self).__init__()192193self.hidden_size = hidden_size194195self.i2h = nn.Linear(input_size, hidden_size)196self.h2h = nn.Linear(hidden_size, hidden_size)197self.h2o = nn.Linear(hidden_size, output_size)198self.softmax = nn.LogSoftmax(dim=1)199200def forward(self, input, hidden):201hidden = F.tanh(self.i2h(input) + self.h2h(hidden))202output = self.h2o(hidden)203output = self.softmax(output)204return output, hidden205206def initHidden(self):207return torch.zeros(1, self.hidden_size)208209n_hidden = 128210rnn = RNN(n_letters, n_hidden, n_categories)211212213######################################################################214# To run a step of this network we need to pass an input (in our case, the215# Tensor for the current letter) and a previous hidden state (which we216# initialize as zeros at first). We'll get back the output (probability of217# each language) and a next hidden state (which we keep for the next218# step).219#220221input = letterToTensor('A')222hidden = torch.zeros(1, n_hidden)223224output, next_hidden = rnn(input, hidden)225226227######################################################################228# For the sake of efficiency we don't want to be creating a new Tensor for229# every step, so we will use ``lineToTensor`` instead of230# ``letterToTensor`` and use slices. This could be further optimized by231# precomputing batches of Tensors.232#233234input = lineToTensor('Albert')235hidden = torch.zeros(1, n_hidden)236237output, next_hidden = rnn(input[0], hidden)238print(output)239240241######################################################################242# As you can see the output is a ``<1 x n_categories>`` Tensor, where243# every item is the likelihood of that category (higher is more likely).244#245246247######################################################################248#249# Training250# ========251# Preparing for Training252# ----------------------253#254# Before going into training we should make a few helper functions. The255# first is to interpret the output of the network, which we know to be a256# likelihood of each category. We can use ``Tensor.topk`` to get the index257# of the greatest value:258#259260def categoryFromOutput(output):261top_n, top_i = output.topk(1)262category_i = top_i[0].item()263return all_categories[category_i], category_i264265print(categoryFromOutput(output))266267268######################################################################269# We will also want a quick way to get a training example (a name and its270# language):271#272273import random274275def randomChoice(l):276return l[random.randint(0, len(l) - 1)]277278def randomTrainingExample():279category = randomChoice(all_categories)280line = randomChoice(category_lines[category])281category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)282line_tensor = lineToTensor(line)283return category, line, category_tensor, line_tensor284285for i in range(10):286category, line, category_tensor, line_tensor = randomTrainingExample()287print('category =', category, '/ line =', line)288289290######################################################################291# Training the Network292# --------------------293#294# Now all it takes to train this network is show it a bunch of examples,295# have it make guesses, and tell it if it's wrong.296#297# For the loss function ``nn.NLLLoss`` is appropriate, since the last298# layer of the RNN is ``nn.LogSoftmax``.299#300301criterion = nn.NLLLoss()302303304######################################################################305# Each loop of training will:306#307# - Create input and target tensors308# - Create a zeroed initial hidden state309# - Read each letter in and310#311# - Keep hidden state for next letter312#313# - Compare final output to target314# - Back-propagate315# - Return the output and loss316#317318learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn319320def train(category_tensor, line_tensor):321hidden = rnn.initHidden()322323rnn.zero_grad()324325for i in range(line_tensor.size()[0]):326output, hidden = rnn(line_tensor[i], hidden)327328loss = criterion(output, category_tensor)329loss.backward()330331# Add parameters' gradients to their values, multiplied by learning rate332for p in rnn.parameters():333p.data.add_(p.grad.data, alpha=-learning_rate)334335return output, loss.item()336337338######################################################################339# Now we just have to run that with a bunch of examples. Since the340# ``train`` function returns both the output and loss we can print its341# guesses and also keep track of loss for plotting. Since there are 1000s342# of examples we print only every ``print_every`` examples, and take an343# average of the loss.344#345346import time347import math348349n_iters = 100000350print_every = 5000351plot_every = 1000352353354355# Keep track of losses for plotting356current_loss = 0357all_losses = []358359def timeSince(since):360now = time.time()361s = now - since362m = math.floor(s / 60)363s -= m * 60364return '%dm %ds' % (m, s)365366start = time.time()367368for iter in range(1, n_iters + 1):369category, line, category_tensor, line_tensor = randomTrainingExample()370output, loss = train(category_tensor, line_tensor)371current_loss += loss372373# Print ``iter`` number, loss, name and guess374if iter % print_every == 0:375guess, guess_i = categoryFromOutput(output)376correct = '✓' if guess == category else '✗ (%s)' % category377print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))378379# Add current loss avg to list of losses380if iter % plot_every == 0:381all_losses.append(current_loss / plot_every)382current_loss = 0383384385######################################################################386# Plotting the Results387# --------------------388#389# Plotting the historical loss from ``all_losses`` shows the network390# learning:391#392393import matplotlib.pyplot as plt394import matplotlib.ticker as ticker395396plt.figure()397plt.plot(all_losses)398399400######################################################################401# Evaluating the Results402# ======================403#404# To see how well the network performs on different categories, we will405# create a confusion matrix, indicating for every actual language (rows)406# which language the network guesses (columns). To calculate the confusion407# matrix a bunch of samples are run through the network with408# ``evaluate()``, which is the same as ``train()`` minus the backprop.409#410411# Keep track of correct guesses in a confusion matrix412confusion = torch.zeros(n_categories, n_categories)413n_confusion = 10000414415# Just return an output given a line416def evaluate(line_tensor):417hidden = rnn.initHidden()418419for i in range(line_tensor.size()[0]):420output, hidden = rnn(line_tensor[i], hidden)421422return output423424# Go through a bunch of examples and record which are correctly guessed425for i in range(n_confusion):426category, line, category_tensor, line_tensor = randomTrainingExample()427output = evaluate(line_tensor)428guess, guess_i = categoryFromOutput(output)429category_i = all_categories.index(category)430confusion[category_i][guess_i] += 1431432# Normalize by dividing every row by its sum433for i in range(n_categories):434confusion[i] = confusion[i] / confusion[i].sum()435436# Set up plot437fig = plt.figure()438ax = fig.add_subplot(111)439cax = ax.matshow(confusion.numpy())440fig.colorbar(cax)441442# Set up axes443ax.set_xticklabels([''] + all_categories, rotation=90)444ax.set_yticklabels([''] + all_categories)445446# Force label at every tick447ax.xaxis.set_major_locator(ticker.MultipleLocator(1))448ax.yaxis.set_major_locator(ticker.MultipleLocator(1))449450# sphinx_gallery_thumbnail_number = 2451plt.show()452453454######################################################################455# You can pick out bright spots off the main axis that show which456# languages it guesses incorrectly, e.g. Chinese for Korean, and Spanish457# for Italian. It seems to do very well with Greek, and very poorly with458# English (perhaps because of overlap with other languages).459#460461462######################################################################463# Running on User Input464# ---------------------465#466467def predict(input_line, n_predictions=3):468print('\n> %s' % input_line)469with torch.no_grad():470output = evaluate(lineToTensor(input_line))471472# Get top N categories473topv, topi = output.topk(n_predictions, 1, True)474predictions = []475476for i in range(n_predictions):477value = topv[0][i].item()478category_index = topi[0][i].item()479print('(%.2f) %s' % (value, all_categories[category_index]))480predictions.append([value, all_categories[category_index]])481482predict('Dovesky')483predict('Jackson')484predict('Satoshi')485486487######################################################################488# The final versions of the scripts `in the Practical PyTorch489# repo <https://github.com/spro/practical-pytorch/tree/master/char-rnn-classification>`__490# split the above code into a few files:491#492# - ``data.py`` (loads files)493# - ``model.py`` (defines the RNN)494# - ``train.py`` (runs training)495# - ``predict.py`` (runs ``predict()`` with command line arguments)496# - ``server.py`` (serve prediction as a JSON API with ``bottle.py``)497#498# Run ``train.py`` to train and save the network.499#500# Run ``predict.py`` with a name to view predictions:501#502# .. code-block:: sh503#504# $ python predict.py Hazaki505# (-0.42) Japanese506# (-1.39) Polish507# (-3.51) Czech508#509# Run ``server.py`` and visit http://localhost:5533/Yourname to get JSON510# output of predictions.511#512513514######################################################################515# Exercises516# =========517#518# - Try with a different dataset of line -> category, for example:519#520# - Any word -> language521# - First name -> gender522# - Character name -> writer523# - Page title -> blog or subreddit524#525# - Get better results with a bigger and/or better shaped network526#527# - Add more linear layers528# - Try the ``nn.LSTM`` and ``nn.GRU`` layers529# - Combine multiple of these RNNs as a higher level network530#531532533