CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/intermediate_source/char_rnn_generation_tutorial.py
Views: 494
# -*- coding: utf-8 -*-1"""2NLP From Scratch: Generating Names with a Character-Level RNN3*************************************************************4**Author**: `Sean Robertson <https://github.com/spro>`_56This is our second of three tutorials on "NLP From Scratch".7In the `first tutorial </tutorials/intermediate/char_rnn_classification_tutorial>`_8we used a RNN to classify names into their language of origin. This time9we'll turn around and generate names from languages.1011.. code-block:: sh1213> python sample.py Russian RUS14Rovakov15Uantov16Shavakov1718> python sample.py German GER19Gerren20Ereng21Rosher2223> python sample.py Spanish SPA24Salla25Parer26Allan2728> python sample.py Chinese CHI29Chan30Hang31Iun3233We are still hand-crafting a small RNN with a few linear layers. The big34difference is instead of predicting a category after reading in all the35letters of a name, we input a category and output one letter at a time.36Recurrently predicting characters to form language (this could also be37done with words or other higher order constructs) is often referred to38as a "language model".3940**Recommended Reading:**4142I assume you have at least installed PyTorch, know Python, and43understand Tensors:4445- https://pytorch.org/ For installation instructions46- :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general47- :doc:`/beginner/pytorch_with_examples` for a wide and deep overview48- :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user4950It would also be useful to know about RNNs and how they work:5152- `The Unreasonable Effectiveness of Recurrent Neural53Networks <https://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__54shows a bunch of real life examples55- `Understanding LSTM56Networks <https://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__57is about LSTMs specifically but also informative about RNNs in58general5960I also suggest the previous tutorial, :doc:`/intermediate/char_rnn_classification_tutorial`616263Preparing the Data64==================6566.. note::67Download the data from68`here <https://download.pytorch.org/tutorial/data.zip>`_69and extract it to the current directory.7071See the last tutorial for more detail of this process. In short, there72are a bunch of plain text files ``data/names/[Language].txt`` with a73name per line. We split lines into an array, convert Unicode to ASCII,74and end up with a dictionary ``{language: [names ...]}``.7576"""77from io import open78import glob79import os80import unicodedata81import string8283all_letters = string.ascii_letters + " .,;'-"84n_letters = len(all_letters) + 1 # Plus EOS marker8586def findFiles(path): return glob.glob(path)8788# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/280942789def unicodeToAscii(s):90return ''.join(91c for c in unicodedata.normalize('NFD', s)92if unicodedata.category(c) != 'Mn'93and c in all_letters94)9596# Read a file and split into lines97def readLines(filename):98with open(filename, encoding='utf-8') as some_file:99return [unicodeToAscii(line.strip()) for line in some_file]100101# Build the category_lines dictionary, a list of lines per category102category_lines = {}103all_categories = []104for filename in findFiles('data/names/*.txt'):105category = os.path.splitext(os.path.basename(filename))[0]106all_categories.append(category)107lines = readLines(filename)108category_lines[category] = lines109110n_categories = len(all_categories)111112if n_categories == 0:113raise RuntimeError('Data not found. Make sure that you downloaded data '114'from https://download.pytorch.org/tutorial/data.zip and extract it to '115'the current directory.')116117print('# categories:', n_categories, all_categories)118print(unicodeToAscii("O'Néàl"))119120121######################################################################122# Creating the Network123# ====================124#125# This network extends `the last tutorial's RNN <#Creating-the-Network>`__126# with an extra argument for the category tensor, which is concatenated127# along with the others. The category tensor is a one-hot vector just like128# the letter input.129#130# We will interpret the output as the probability of the next letter. When131# sampling, the most likely output letter is used as the next input132# letter.133#134# I added a second linear layer ``o2o`` (after combining hidden and135# output) to give it more muscle to work with. There's also a dropout136# layer, which `randomly zeros parts of its137# input <https://arxiv.org/abs/1207.0580>`__ with a given probability138# (here 0.1) and is usually used to fuzz inputs to prevent overfitting.139# Here we're using it towards the end of the network to purposely add some140# chaos and increase sampling variety.141#142# .. figure:: https://i.imgur.com/jzVrf7f.png143# :alt:144#145#146147import torch148import torch.nn as nn149150class RNN(nn.Module):151def __init__(self, input_size, hidden_size, output_size):152super(RNN, self).__init__()153self.hidden_size = hidden_size154155self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)156self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)157self.o2o = nn.Linear(hidden_size + output_size, output_size)158self.dropout = nn.Dropout(0.1)159self.softmax = nn.LogSoftmax(dim=1)160161def forward(self, category, input, hidden):162input_combined = torch.cat((category, input, hidden), 1)163hidden = self.i2h(input_combined)164output = self.i2o(input_combined)165output_combined = torch.cat((hidden, output), 1)166output = self.o2o(output_combined)167output = self.dropout(output)168output = self.softmax(output)169return output, hidden170171def initHidden(self):172return torch.zeros(1, self.hidden_size)173174175######################################################################176# Training177# =========178# Preparing for Training179# ----------------------180#181# First of all, helper functions to get random pairs of (category, line):182#183184import random185186# Random item from a list187def randomChoice(l):188return l[random.randint(0, len(l) - 1)]189190# Get a random category and random line from that category191def randomTrainingPair():192category = randomChoice(all_categories)193line = randomChoice(category_lines[category])194return category, line195196197######################################################################198# For each timestep (that is, for each letter in a training word) the199# inputs of the network will be200# ``(category, current letter, hidden state)`` and the outputs will be201# ``(next letter, next hidden state)``. So for each training set, we'll202# need the category, a set of input letters, and a set of output/target203# letters.204#205# Since we are predicting the next letter from the current letter for each206# timestep, the letter pairs are groups of consecutive letters from the207# line - e.g. for ``"ABCD<EOS>"`` we would create ("A", "B"), ("B", "C"),208# ("C", "D"), ("D", "EOS").209#210# .. figure:: https://i.imgur.com/JH58tXY.png211# :alt:212#213# The category tensor is a `one-hot214# tensor <https://en.wikipedia.org/wiki/One-hot>`__ of size215# ``<1 x n_categories>``. When training we feed it to the network at every216# timestep - this is a design choice, it could have been included as part217# of initial hidden state or some other strategy.218#219220# One-hot vector for category221def categoryTensor(category):222li = all_categories.index(category)223tensor = torch.zeros(1, n_categories)224tensor[0][li] = 1225return tensor226227# One-hot matrix of first to last letters (not including EOS) for input228def inputTensor(line):229tensor = torch.zeros(len(line), 1, n_letters)230for li in range(len(line)):231letter = line[li]232tensor[li][0][all_letters.find(letter)] = 1233return tensor234235# ``LongTensor`` of second letter to end (EOS) for target236def targetTensor(line):237letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]238letter_indexes.append(n_letters - 1) # EOS239return torch.LongTensor(letter_indexes)240241242######################################################################243# For convenience during training we'll make a ``randomTrainingExample``244# function that fetches a random (category, line) pair and turns them into245# the required (category, input, target) tensors.246#247248# Make category, input, and target tensors from a random category, line pair249def randomTrainingExample():250category, line = randomTrainingPair()251category_tensor = categoryTensor(category)252input_line_tensor = inputTensor(line)253target_line_tensor = targetTensor(line)254return category_tensor, input_line_tensor, target_line_tensor255256257######################################################################258# Training the Network259# --------------------260#261# In contrast to classification, where only the last output is used, we262# are making a prediction at every step, so we are calculating loss at263# every step.264#265# The magic of autograd allows you to simply sum these losses at each step266# and call backward at the end.267#268269criterion = nn.NLLLoss()270271learning_rate = 0.0005272273def train(category_tensor, input_line_tensor, target_line_tensor):274target_line_tensor.unsqueeze_(-1)275hidden = rnn.initHidden()276277rnn.zero_grad()278279loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``280281for i in range(input_line_tensor.size(0)):282output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)283l = criterion(output, target_line_tensor[i])284loss += l285286loss.backward()287288for p in rnn.parameters():289p.data.add_(p.grad.data, alpha=-learning_rate)290291return output, loss.item() / input_line_tensor.size(0)292293294######################################################################295# To keep track of how long training takes I am adding a296# ``timeSince(timestamp)`` function which returns a human readable string:297#298299import time300import math301302def timeSince(since):303now = time.time()304s = now - since305m = math.floor(s / 60)306s -= m * 60307return '%dm %ds' % (m, s)308309310######################################################################311# Training is business as usual - call train a bunch of times and wait a312# few minutes, printing the current time and loss every ``print_every``313# examples, and keeping store of an average loss per ``plot_every`` examples314# in ``all_losses`` for plotting later.315#316317rnn = RNN(n_letters, 128, n_letters)318319n_iters = 100000320print_every = 5000321plot_every = 500322all_losses = []323total_loss = 0 # Reset every ``plot_every`` ``iters``324325start = time.time()326327for iter in range(1, n_iters + 1):328output, loss = train(*randomTrainingExample())329total_loss += loss330331if iter % print_every == 0:332print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))333334if iter % plot_every == 0:335all_losses.append(total_loss / plot_every)336total_loss = 0337338339######################################################################340# Plotting the Losses341# -------------------342#343# Plotting the historical loss from all\_losses shows the network344# learning:345#346347import matplotlib.pyplot as plt348349plt.figure()350plt.plot(all_losses)351352353######################################################################354# Sampling the Network355# ====================356#357# To sample we give the network a letter and ask what the next one is,358# feed that in as the next letter, and repeat until the EOS token.359#360# - Create tensors for input category, starting letter, and empty hidden361# state362# - Create a string ``output_name`` with the starting letter363# - Up to a maximum output length,364#365# - Feed the current letter to the network366# - Get the next letter from highest output, and next hidden state367# - If the letter is EOS, stop here368# - If a regular letter, add to ``output_name`` and continue369#370# - Return the final name371#372# .. note::373# Rather than having to give it a starting letter, another374# strategy would have been to include a "start of string" token in375# training and have the network choose its own starting letter.376#377378max_length = 20379380# Sample from a category and starting letter381def sample(category, start_letter='A'):382with torch.no_grad(): # no need to track history in sampling383category_tensor = categoryTensor(category)384input = inputTensor(start_letter)385hidden = rnn.initHidden()386387output_name = start_letter388389for i in range(max_length):390output, hidden = rnn(category_tensor, input[0], hidden)391topv, topi = output.topk(1)392topi = topi[0][0]393if topi == n_letters - 1:394break395else:396letter = all_letters[topi]397output_name += letter398input = inputTensor(letter)399400return output_name401402# Get multiple samples from one category and multiple starting letters403def samples(category, start_letters='ABC'):404for start_letter in start_letters:405print(sample(category, start_letter))406407samples('Russian', 'RUS')408409samples('German', 'GER')410411samples('Spanish', 'SPA')412413samples('Chinese', 'CHI')414415416######################################################################417# Exercises418# =========419#420# - Try with a different dataset of category -> line, for example:421#422# - Fictional series -> Character name423# - Part of speech -> Word424# - Country -> City425#426# - Use a "start of sentence" token so that sampling can be done without427# choosing a start letter428# - Get better results with a bigger and/or better shaped network429#430# - Try the ``nn.LSTM`` and ``nn.GRU`` layers431# - Combine multiple of these RNNs as a higher level network432#433434435