Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Path: blob/main/beginner_source/chatbot_tutorial.py
Views: 712
# -*- coding: utf-8 -*-12"""3Chatbot Tutorial4================5**Author:** `Matthew Inkawhich <https://github.com/MatthewInkawhich>`_6"""789######################################################################10# In this tutorial, we explore a fun and interesting use-case of recurrent11# sequence-to-sequence models. We will train a simple chatbot using movie12# scripts from the `Cornell Movie-Dialogs13# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__.14#15# Conversational models are a hot topic in artificial intelligence16# research. Chatbots can be found in a variety of settings, including17# customer service applications and online helpdesks. These bots are often18# powered by retrieval-based models, which output predefined responses to19# questions of certain forms. In a highly restricted domain like a20# company’s IT helpdesk, these models may be sufficient, however, they are21# not robust enough for more general use-cases. Teaching a machine to22# carry out a meaningful conversation with a human in multiple domains is23# a research question that is far from solved. Recently, the deep learning24# boom has allowed for powerful generative models like Google’s `Neural25# Conversational Model <https://arxiv.org/abs/1506.05869>`__, which marks26# a large step towards multi-domain generative conversational models. In27# this tutorial, we will implement this kind of model in PyTorch.28#29# .. figure:: /_static/img/chatbot/bot.png30# :align: center31# :alt: bot32#33# .. code-block:: python34#35# > hello?36# Bot: hello .37# > where am I?38# Bot: you re in a hospital .39# > who are you?40# Bot: i m a lawyer .41# > how are you doing?42# Bot: i m fine .43# > are you my friend?44# Bot: no .45# > you're under arrest46# Bot: i m trying to help you !47# > i'm just kidding48# Bot: i m sorry .49# > where are you from?50# Bot: san francisco .51# > it's time for me to leave52# Bot: i know .53# > goodbye54# Bot: goodbye .55#56# **Tutorial Highlights**57#58# - Handle loading and preprocessing of `Cornell Movie-Dialogs59# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__60# dataset61# - Implement a sequence-to-sequence model with `Luong attention62# mechanism(s) <https://arxiv.org/abs/1508.04025>`__63# - Jointly train encoder and decoder models using mini-batches64# - Implement greedy-search decoding module65# - Interact with trained chatbot66#67# **Acknowledgments**68#69# This tutorial borrows code from the following sources:70#71# 1) Yuan-Kuei Wu’s pytorch-chatbot implementation:72# https://github.com/ywk991112/pytorch-chatbot73#74# 2) Sean Robertson’s practical-pytorch seq2seq-translation example:75# https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation76#77# 3) FloydHub Cornell Movie Corpus preprocessing code:78# https://github.com/floydhub/textutil-preprocess-cornell-movie-corpus79#808182######################################################################83# Preparations84# ------------85#86# To get started, `download <https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip>`__ the Movie-Dialogs Corpus zip file.8788# and put in a ``data/`` directory under the current directory.89#90# After that, let’s import some necessities.91#9293import torch94from torch.jit import script, trace95import torch.nn as nn96from torch import optim97import torch.nn.functional as F98import csv99import random100import re101import os102import unicodedata103import codecs104from io import open105import itertools106import math107import json108109110USE_CUDA = torch.cuda.is_available()111device = torch.device("cuda" if USE_CUDA else "cpu")112113114######################################################################115# Load & Preprocess Data116# ----------------------117#118# The next step is to reformat our data file and load the data into119# structures that we can work with.120#121# The `Cornell Movie-Dialogs122# Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__123# is a rich dataset of movie character dialog:124#125# - 220,579 conversational exchanges between 10,292 pairs of movie126# characters127# - 9,035 characters from 617 movies128# - 304,713 total utterances129#130# This dataset is large and diverse, and there is a great variation of131# language formality, time periods, sentiment, etc. Our hope is that this132# diversity makes our model robust to many forms of inputs and queries.133#134# First, we’ll take a look at some lines of our datafile to see the135# original format.136#137138corpus_name = "movie-corpus"139corpus = os.path.join("data", corpus_name)140141def printLines(file, n=10):142with open(file, 'rb') as datafile:143lines = datafile.readlines()144for line in lines[:n]:145print(line)146147printLines(os.path.join(corpus, "utterances.jsonl"))148149150######################################################################151# Create formatted data file152# ~~~~~~~~~~~~~~~~~~~~~~~~~~153#154# For convenience, we'll create a nicely formatted data file in which each line155# contains a tab-separated *query sentence* and a *response sentence* pair.156#157# The following functions facilitate the parsing of the raw158# ``utterances.jsonl`` data file.159#160# - ``loadLinesAndConversations`` splits each line of the file into a dictionary of161# lines with fields: ``lineID``, ``characterID``, and text and then groups them162# into conversations with fields: ``conversationID``, ``movieID``, and lines.163# - ``extractSentencePairs`` extracts pairs of sentences from164# conversations165#166167# Splits each line of the file to create lines and conversations168def loadLinesAndConversations(fileName):169lines = {}170conversations = {}171with open(fileName, 'r', encoding='iso-8859-1') as f:172for line in f:173lineJson = json.loads(line)174# Extract fields for line object175lineObj = {}176lineObj["lineID"] = lineJson["id"]177lineObj["characterID"] = lineJson["speaker"]178lineObj["text"] = lineJson["text"]179lines[lineObj['lineID']] = lineObj180181# Extract fields for conversation object182if lineJson["conversation_id"] not in conversations:183convObj = {}184convObj["conversationID"] = lineJson["conversation_id"]185convObj["movieID"] = lineJson["meta"]["movie_id"]186convObj["lines"] = [lineObj]187else:188convObj = conversations[lineJson["conversation_id"]]189convObj["lines"].insert(0, lineObj)190conversations[convObj["conversationID"]] = convObj191192return lines, conversations193194195# Extracts pairs of sentences from conversations196def extractSentencePairs(conversations):197qa_pairs = []198for conversation in conversations.values():199# Iterate over all the lines of the conversation200for i in range(len(conversation["lines"]) - 1): # We ignore the last line (no answer for it)201inputLine = conversation["lines"][i]["text"].strip()202targetLine = conversation["lines"][i+1]["text"].strip()203# Filter wrong samples (if one of the lists is empty)204if inputLine and targetLine:205qa_pairs.append([inputLine, targetLine])206return qa_pairs207208209######################################################################210# Now we’ll call these functions and create the file. We’ll call it211# ``formatted_movie_lines.txt``.212#213214# Define path to new file215datafile = os.path.join(corpus, "formatted_movie_lines.txt")216217delimiter = '\t'218# Unescape the delimiter219delimiter = str(codecs.decode(delimiter, "unicode_escape"))220221# Initialize lines dict and conversations dict222lines = {}223conversations = {}224# Load lines and conversations225print("\nProcessing corpus into lines and conversations...")226lines, conversations = loadLinesAndConversations(os.path.join(corpus, "utterances.jsonl"))227228# Write new csv file229print("\nWriting newly formatted file...")230with open(datafile, 'w', encoding='utf-8') as outputfile:231writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')232for pair in extractSentencePairs(conversations):233writer.writerow(pair)234235# Print a sample of lines236print("\nSample lines from file:")237printLines(datafile)238239240######################################################################241# Load and trim data242# ~~~~~~~~~~~~~~~~~~243#244# Our next order of business is to create a vocabulary and load245# query/response sentence pairs into memory.246#247# Note that we are dealing with sequences of **words**, which do not have248# an implicit mapping to a discrete numerical space. Thus, we must create249# one by mapping each unique word that we encounter in our dataset to an250# index value.251#252# For this we define a ``Voc`` class, which keeps a mapping from words to253# indexes, a reverse mapping of indexes to words, a count of each word and254# a total word count. The class provides methods for adding a word to the255# vocabulary (``addWord``), adding all words in a sentence256# (``addSentence``) and trimming infrequently seen words (``trim``). More257# on trimming later.258#259260# Default word tokens261PAD_token = 0 # Used for padding short sentences262SOS_token = 1 # Start-of-sentence token263EOS_token = 2 # End-of-sentence token264265class Voc:266def __init__(self, name):267self.name = name268self.trimmed = False269self.word2index = {}270self.word2count = {}271self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}272self.num_words = 3 # Count SOS, EOS, PAD273274def addSentence(self, sentence):275for word in sentence.split(' '):276self.addWord(word)277278def addWord(self, word):279if word not in self.word2index:280self.word2index[word] = self.num_words281self.word2count[word] = 1282self.index2word[self.num_words] = word283self.num_words += 1284else:285self.word2count[word] += 1286287# Remove words below a certain count threshold288def trim(self, min_count):289if self.trimmed:290return291self.trimmed = True292293keep_words = []294295for k, v in self.word2count.items():296if v >= min_count:297keep_words.append(k)298299print('keep_words {} / {} = {:.4f}'.format(300len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)301))302303# Reinitialize dictionaries304self.word2index = {}305self.word2count = {}306self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}307self.num_words = 3 # Count default tokens308309for word in keep_words:310self.addWord(word)311312313######################################################################314# Now we can assemble our vocabulary and query/response sentence pairs.315# Before we are ready to use this data, we must perform some316# preprocessing.317#318# First, we must convert the Unicode strings to ASCII using319# ``unicodeToAscii``. Next, we should convert all letters to lowercase and320# trim all non-letter characters except for basic punctuation321# (``normalizeString``). Finally, to aid in training convergence, we will322# filter out sentences with length greater than the ``MAX_LENGTH``323# threshold (``filterPairs``).324#325326MAX_LENGTH = 10 # Maximum sentence length to consider327328# Turn a Unicode string to plain ASCII, thanks to329# https://stackoverflow.com/a/518232/2809427330def unicodeToAscii(s):331return ''.join(332c for c in unicodedata.normalize('NFD', s)333if unicodedata.category(c) != 'Mn'334)335336# Lowercase, trim, and remove non-letter characters337def normalizeString(s):338s = unicodeToAscii(s.lower().strip())339s = re.sub(r"([.!?])", r" \1", s)340s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)341s = re.sub(r"\s+", r" ", s).strip()342return s343344# Read query/response pairs and return a voc object345def readVocs(datafile, corpus_name):346print("Reading lines...")347# Read the file and split into lines348lines = open(datafile, encoding='utf-8').\349read().strip().split('\n')350# Split every line into pairs and normalize351pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]352voc = Voc(corpus_name)353return voc, pairs354355# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold356def filterPair(p):357# Input sequences need to preserve the last word for EOS token358return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH359360# Filter pairs using the ``filterPair`` condition361def filterPairs(pairs):362return [pair for pair in pairs if filterPair(pair)]363364# Using the functions defined above, return a populated voc object and pairs list365def loadPrepareData(corpus, corpus_name, datafile, save_dir):366print("Start preparing training data ...")367voc, pairs = readVocs(datafile, corpus_name)368print("Read {!s} sentence pairs".format(len(pairs)))369pairs = filterPairs(pairs)370print("Trimmed to {!s} sentence pairs".format(len(pairs)))371print("Counting words...")372for pair in pairs:373voc.addSentence(pair[0])374voc.addSentence(pair[1])375print("Counted words:", voc.num_words)376return voc, pairs377378379# Load/Assemble voc and pairs380save_dir = os.path.join("data", "save")381voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)382# Print some pairs to validate383print("\npairs:")384for pair in pairs[:10]:385print(pair)386387388######################################################################389# Another tactic that is beneficial to achieving faster convergence during390# training is trimming rarely used words out of our vocabulary. Decreasing391# the feature space will also soften the difficulty of the function that392# the model must learn to approximate. We will do this as a two-step393# process:394#395# 1) Trim words used under ``MIN_COUNT`` threshold using the ``voc.trim``396# function.397#398# 2) Filter out pairs with trimmed words.399#400401MIN_COUNT = 3 # Minimum word count threshold for trimming402403def trimRareWords(voc, pairs, MIN_COUNT):404# Trim words used under the MIN_COUNT from the voc405voc.trim(MIN_COUNT)406# Filter out pairs with trimmed words407keep_pairs = []408for pair in pairs:409input_sentence = pair[0]410output_sentence = pair[1]411keep_input = True412keep_output = True413# Check input sentence414for word in input_sentence.split(' '):415if word not in voc.word2index:416keep_input = False417break418# Check output sentence419for word in output_sentence.split(' '):420if word not in voc.word2index:421keep_output = False422break423424# Only keep pairs that do not contain trimmed word(s) in their input or output sentence425if keep_input and keep_output:426keep_pairs.append(pair)427428print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))429return keep_pairs430431432# Trim voc and pairs433pairs = trimRareWords(voc, pairs, MIN_COUNT)434435436######################################################################437# Prepare Data for Models438# -----------------------439#440# Although we have put a great deal of effort into preparing and massaging our441# data into a nice vocabulary object and list of sentence pairs, our models442# will ultimately expect numerical torch tensors as inputs. One way to443# prepare the processed data for the models can be found in the `seq2seq444# translation445# tutorial <https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html>`__.446# In that tutorial, we use a batch size of 1, meaning that all we have to447# do is convert the words in our sentence pairs to their corresponding448# indexes from the vocabulary and feed this to the models.449#450# However, if you’re interested in speeding up training and/or would like451# to leverage GPU parallelization capabilities, you will need to train452# with mini-batches.453#454# Using mini-batches also means that we must be mindful of the variation455# of sentence length in our batches. To accommodate sentences of different456# sizes in the same batch, we will make our batched input tensor of shape457# *(max_length, batch_size)*, where sentences shorter than the458# *max_length* are zero padded after an *EOS_token*.459#460# If we simply convert our English sentences to tensors by converting461# words to their indexes(\ ``indexesFromSentence``) and zero-pad, our462# tensor would have shape *(batch_size, max_length)* and indexing the463# first dimension would return a full sequence across all time-steps.464# However, we need to be able to index our batch along time, and across465# all sequences in the batch. Therefore, we transpose our input batch466# shape to *(max_length, batch_size)*, so that indexing across the first467# dimension returns a time step across all sentences in the batch. We468# handle this transpose implicitly in the ``zeroPadding`` function.469#470# .. figure:: /_static/img/chatbot/seq2seq_batches.png471# :align: center472# :alt: batches473#474# The ``inputVar`` function handles the process of converting sentences to475# tensor, ultimately creating a correctly shaped zero-padded tensor. It476# also returns a tensor of ``lengths`` for each of the sequences in the477# batch which will be passed to our decoder later.478#479# The ``outputVar`` function performs a similar function to ``inputVar``,480# but instead of returning a ``lengths`` tensor, it returns a binary mask481# tensor and a maximum target sentence length. The binary mask tensor has482# the same shape as the output target tensor, but every element that is a483# *PAD_token* is 0 and all others are 1.484#485# ``batch2TrainData`` simply takes a bunch of pairs and returns the input486# and target tensors using the aforementioned functions.487#488489def indexesFromSentence(voc, sentence):490return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]491492493def zeroPadding(l, fillvalue=PAD_token):494return list(itertools.zip_longest(*l, fillvalue=fillvalue))495496def binaryMatrix(l, value=PAD_token):497m = []498for i, seq in enumerate(l):499m.append([])500for token in seq:501if token == PAD_token:502m[i].append(0)503else:504m[i].append(1)505return m506507# Returns padded input sequence tensor and lengths508def inputVar(l, voc):509indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]510lengths = torch.tensor([len(indexes) for indexes in indexes_batch])511padList = zeroPadding(indexes_batch)512padVar = torch.LongTensor(padList)513return padVar, lengths514515# Returns padded target sequence tensor, padding mask, and max target length516def outputVar(l, voc):517indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]518max_target_len = max([len(indexes) for indexes in indexes_batch])519padList = zeroPadding(indexes_batch)520mask = binaryMatrix(padList)521mask = torch.BoolTensor(mask)522padVar = torch.LongTensor(padList)523return padVar, mask, max_target_len524525# Returns all items for a given batch of pairs526def batch2TrainData(voc, pair_batch):527pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)528input_batch, output_batch = [], []529for pair in pair_batch:530input_batch.append(pair[0])531output_batch.append(pair[1])532inp, lengths = inputVar(input_batch, voc)533output, mask, max_target_len = outputVar(output_batch, voc)534return inp, lengths, output, mask, max_target_len535536537# Example for validation538small_batch_size = 5539batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])540input_variable, lengths, target_variable, mask, max_target_len = batches541542print("input_variable:", input_variable)543print("lengths:", lengths)544print("target_variable:", target_variable)545print("mask:", mask)546print("max_target_len:", max_target_len)547548549######################################################################550# Define Models551# -------------552#553# Seq2Seq Model554# ~~~~~~~~~~~~~555#556# The brains of our chatbot is a sequence-to-sequence (seq2seq) model. The557# goal of a seq2seq model is to take a variable-length sequence as an558# input, and return a variable-length sequence as an output using a559# fixed-sized model.560#561# `Sutskever et al. <https://arxiv.org/abs/1409.3215>`__ discovered that562# by using two separate recurrent neural nets together, we can accomplish563# this task. One RNN acts as an **encoder**, which encodes a variable564# length input sequence to a fixed-length context vector. In theory, this565# context vector (the final hidden layer of the RNN) will contain semantic566# information about the query sentence that is input to the bot. The567# second RNN is a **decoder**, which takes an input word and the context568# vector, and returns a guess for the next word in the sequence and a569# hidden state to use in the next iteration.570#571# .. figure:: /_static/img/chatbot/seq2seq_ts.png572# :align: center573# :alt: model574#575# Image source:576# https://jeddy92.github.io/JEddy92.github.io/ts_seq2seq_intro/577#578579580######################################################################581# Encoder582# ~~~~~~~583#584# The encoder RNN iterates through the input sentence one token585# (e.g. word) at a time, at each time step outputting an “output” vector586# and a “hidden state” vector. The hidden state vector is then passed to587# the next time step, while the output vector is recorded. The encoder588# transforms the context it saw at each point in the sequence into a set589# of points in a high-dimensional space, which the decoder will use to590# generate a meaningful output for the given task.591#592# At the heart of our encoder is a multi-layered Gated Recurrent Unit,593# invented by `Cho et al. <https://arxiv.org/pdf/1406.1078v3.pdf>`__ in594# 2014. We will use a bidirectional variant of the GRU, meaning that there595# are essentially two independent RNNs: one that is fed the input sequence596# in normal sequential order, and one that is fed the input sequence in597# reverse order. The outputs of each network are summed at each time step.598# Using a bidirectional GRU will give us the advantage of encoding both599# past and future contexts.600#601# Bidirectional RNN:602#603# .. figure:: /_static/img/chatbot/RNN-bidirectional.png604# :width: 70%605# :align: center606# :alt: rnn_bidir607#608# Image source: https://colah.github.io/posts/2015-09-NN-Types-FP/609#610# Note that an ``embedding`` layer is used to encode our word indices in611# an arbitrarily sized feature space. For our models, this layer will map612# each word to a feature space of size *hidden_size*. When trained, these613# values should encode semantic similarity between similar meaning words.614#615# Finally, if passing a padded batch of sequences to an RNN module, we616# must pack and unpack padding around the RNN pass using617# ``nn.utils.rnn.pack_padded_sequence`` and618# ``nn.utils.rnn.pad_packed_sequence`` respectively.619#620# **Computation Graph:**621#622# 1) Convert word indexes to embeddings.623# 2) Pack padded batch of sequences for RNN module.624# 3) Forward pass through GRU.625# 4) Unpack padding.626# 5) Sum bidirectional GRU outputs.627# 6) Return output and final hidden state.628#629# **Inputs:**630#631# - ``input_seq``: batch of input sentences; shape=\ *(max_length,632# batch_size)*633# - ``input_lengths``: list of sentence lengths corresponding to each634# sentence in the batch; shape=\ *(batch_size)*635# - ``hidden``: hidden state; shape=\ *(n_layers x num_directions,636# batch_size, hidden_size)*637#638# **Outputs:**639#640# - ``outputs``: output features from the last hidden layer of the GRU641# (sum of bidirectional outputs); shape=\ *(max_length, batch_size,642# hidden_size)*643# - ``hidden``: updated hidden state from GRU; shape=\ *(n_layers x644# num_directions, batch_size, hidden_size)*645#646#647648class EncoderRNN(nn.Module):649def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):650super(EncoderRNN, self).__init__()651self.n_layers = n_layers652self.hidden_size = hidden_size653self.embedding = embedding654655# Initialize GRU; the input_size and hidden_size parameters are both set to 'hidden_size'656# because our input size is a word embedding with number of features == hidden_size657self.gru = nn.GRU(hidden_size, hidden_size, n_layers,658dropout=(0 if n_layers == 1 else dropout), bidirectional=True)659660def forward(self, input_seq, input_lengths, hidden=None):661# Convert word indexes to embeddings662embedded = self.embedding(input_seq)663# Pack padded batch of sequences for RNN module664packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)665# Forward pass through GRU666outputs, hidden = self.gru(packed, hidden)667# Unpack padding668outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)669# Sum bidirectional GRU outputs670outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]671# Return output and final hidden state672return outputs, hidden673674675######################################################################676# Decoder677# ~~~~~~~678#679# The decoder RNN generates the response sentence in a token-by-token680# fashion. It uses the encoder’s context vectors, and internal hidden681# states to generate the next word in the sequence. It continues682# generating words until it outputs an *EOS_token*, representing the end683# of the sentence. A common problem with a vanilla seq2seq decoder is that684# if we rely solely on the context vector to encode the entire input685# sequence’s meaning, it is likely that we will have information loss.686# This is especially the case when dealing with long input sequences,687# greatly limiting the capability of our decoder.688#689# To combat this, `Bahdanau et al. <https://arxiv.org/abs/1409.0473>`__690# created an “attention mechanism” that allows the decoder to pay691# attention to certain parts of the input sequence, rather than using the692# entire fixed context at every step.693#694# At a high level, attention is calculated using the decoder’s current695# hidden state and the encoder’s outputs. The output attention weights696# have the same shape as the input sequence, allowing us to multiply them697# by the encoder outputs, giving us a weighted sum which indicates the698# parts of encoder output to pay attention to. `Sean699# Robertson’s <https://github.com/spro>`__ figure describes this very700# well:701#702# .. figure:: /_static/img/chatbot/attn2.png703# :align: center704# :alt: attn2705#706# `Luong et al. <https://arxiv.org/abs/1508.04025>`__ improved upon707# Bahdanau et al.’s groundwork by creating “Global attention”. The key708# difference is that with “Global attention”, we consider all of the709# encoder’s hidden states, as opposed to Bahdanau et al.’s “Local710# attention”, which only considers the encoder’s hidden state from the711# current time step. Another difference is that with “Global attention”,712# we calculate attention weights, or energies, using the hidden state of713# the decoder from the current time step only. Bahdanau et al.’s attention714# calculation requires knowledge of the decoder’s state from the previous715# time step. Also, Luong et al. provides various methods to calculate the716# attention energies between the encoder output and decoder output which717# are called “score functions”:718#719# .. figure:: /_static/img/chatbot/scores.png720# :width: 60%721# :align: center722# :alt: scores723#724# where :math:`h_t` = current target decoder state and :math:`\bar{h}_s` =725# all encoder states.726#727# Overall, the Global attention mechanism can be summarized by the728# following figure. Note that we will implement the “Attention Layer” as a729# separate ``nn.Module`` called ``Attn``. The output of this module is a730# softmax normalized weights tensor of shape *(batch_size, 1,731# max_length)*.732#733# .. figure:: /_static/img/chatbot/global_attn.png734# :align: center735# :width: 60%736# :alt: global_attn737#738739# Luong attention layer740class Attn(nn.Module):741def __init__(self, method, hidden_size):742super(Attn, self).__init__()743self.method = method744if self.method not in ['dot', 'general', 'concat']:745raise ValueError(self.method, "is not an appropriate attention method.")746self.hidden_size = hidden_size747if self.method == 'general':748self.attn = nn.Linear(self.hidden_size, hidden_size)749elif self.method == 'concat':750self.attn = nn.Linear(self.hidden_size * 2, hidden_size)751self.v = nn.Parameter(torch.FloatTensor(hidden_size))752753def dot_score(self, hidden, encoder_output):754return torch.sum(hidden * encoder_output, dim=2)755756def general_score(self, hidden, encoder_output):757energy = self.attn(encoder_output)758return torch.sum(hidden * energy, dim=2)759760def concat_score(self, hidden, encoder_output):761energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()762return torch.sum(self.v * energy, dim=2)763764def forward(self, hidden, encoder_outputs):765# Calculate the attention weights (energies) based on the given method766if self.method == 'general':767attn_energies = self.general_score(hidden, encoder_outputs)768elif self.method == 'concat':769attn_energies = self.concat_score(hidden, encoder_outputs)770elif self.method == 'dot':771attn_energies = self.dot_score(hidden, encoder_outputs)772773# Transpose max_length and batch_size dimensions774attn_energies = attn_energies.t()775776# Return the softmax normalized probability scores (with added dimension)777return F.softmax(attn_energies, dim=1).unsqueeze(1)778779780######################################################################781# Now that we have defined our attention submodule, we can implement the782# actual decoder model. For the decoder, we will manually feed our batch783# one time step at a time. This means that our embedded word tensor and784# GRU output will both have shape *(1, batch_size, hidden_size)*.785#786# **Computation Graph:**787#788# 1) Get embedding of current input word.789# 2) Forward through unidirectional GRU.790# 3) Calculate attention weights from the current GRU output from (2).791# 4) Multiply attention weights to encoder outputs to get new "weighted sum" context vector.792# 5) Concatenate weighted context vector and GRU output using Luong eq. 5.793# 6) Predict next word using Luong eq. 6 (without softmax).794# 7) Return output and final hidden state.795#796# **Inputs:**797#798# - ``input_step``: one time step (one word) of input sequence batch;799# shape=\ *(1, batch_size)*800# - ``last_hidden``: final hidden layer of GRU; shape=\ *(n_layers x801# num_directions, batch_size, hidden_size)*802# - ``encoder_outputs``: encoder model’s output; shape=\ *(max_length,803# batch_size, hidden_size)*804#805# **Outputs:**806#807# - ``output``: softmax normalized tensor giving probabilities of each808# word being the correct next word in the decoded sequence;809# shape=\ *(batch_size, voc.num_words)*810# - ``hidden``: final hidden state of GRU; shape=\ *(n_layers x811# num_directions, batch_size, hidden_size)*812#813814class LuongAttnDecoderRNN(nn.Module):815def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):816super(LuongAttnDecoderRNN, self).__init__()817818# Keep for reference819self.attn_model = attn_model820self.hidden_size = hidden_size821self.output_size = output_size822self.n_layers = n_layers823self.dropout = dropout824825# Define layers826self.embedding = embedding827self.embedding_dropout = nn.Dropout(dropout)828self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))829self.concat = nn.Linear(hidden_size * 2, hidden_size)830self.out = nn.Linear(hidden_size, output_size)831832self.attn = Attn(attn_model, hidden_size)833834def forward(self, input_step, last_hidden, encoder_outputs):835# Note: we run this one step (word) at a time836# Get embedding of current input word837embedded = self.embedding(input_step)838embedded = self.embedding_dropout(embedded)839# Forward through unidirectional GRU840rnn_output, hidden = self.gru(embedded, last_hidden)841# Calculate attention weights from the current GRU output842attn_weights = self.attn(rnn_output, encoder_outputs)843# Multiply attention weights to encoder outputs to get new "weighted sum" context vector844context = attn_weights.bmm(encoder_outputs.transpose(0, 1))845# Concatenate weighted context vector and GRU output using Luong eq. 5846rnn_output = rnn_output.squeeze(0)847context = context.squeeze(1)848concat_input = torch.cat((rnn_output, context), 1)849concat_output = torch.tanh(self.concat(concat_input))850# Predict next word using Luong eq. 6851output = self.out(concat_output)852output = F.softmax(output, dim=1)853# Return output and final hidden state854return output, hidden855856857######################################################################858# Define Training Procedure859# -------------------------860#861# Masked loss862# ~~~~~~~~~~~863#864# Since we are dealing with batches of padded sequences, we cannot simply865# consider all elements of the tensor when calculating loss. We define866# ``maskNLLLoss`` to calculate our loss based on our decoder’s output867# tensor, the target tensor, and a binary mask tensor describing the868# padding of the target tensor. This loss function calculates the average869# negative log likelihood of the elements that correspond to a *1* in the870# mask tensor.871#872873def maskNLLLoss(inp, target, mask):874nTotal = mask.sum()875crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))876loss = crossEntropy.masked_select(mask).mean()877loss = loss.to(device)878return loss, nTotal.item()879880881######################################################################882# Single training iteration883# ~~~~~~~~~~~~~~~~~~~~~~~~~884#885# The ``train`` function contains the algorithm for a single training886# iteration (a single batch of inputs).887#888# We will use a couple of clever tricks to aid in convergence:889#890# - The first trick is using **teacher forcing**. This means that at some891# probability, set by ``teacher_forcing_ratio``, we use the current892# target word as the decoder’s next input rather than using the893# decoder’s current guess. This technique acts as training wheels for894# the decoder, aiding in more efficient training. However, teacher895# forcing can lead to model instability during inference, as the896# decoder may not have a sufficient chance to truly craft its own897# output sequences during training. Thus, we must be mindful of how we898# are setting the ``teacher_forcing_ratio``, and not be fooled by fast899# convergence.900#901# - The second trick that we implement is **gradient clipping**. This is902# a commonly used technique for countering the “exploding gradient”903# problem. In essence, by clipping or thresholding gradients to a904# maximum value, we prevent the gradients from growing exponentially905# and either overflow (NaN), or overshoot steep cliffs in the cost906# function.907#908# .. figure:: /_static/img/chatbot/grad_clip.png909# :align: center910# :width: 60%911# :alt: grad_clip912#913# Image source: Goodfellow et al. *Deep Learning*. 2016. https://www.deeplearningbook.org/914#915# **Sequence of Operations:**916#917# 1) Forward pass entire input batch through encoder.918# 2) Initialize decoder inputs as SOS_token, and hidden state as the encoder's final hidden state.919# 3) Forward input batch sequence through decoder one time step at a time.920# 4) If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.921# 5) Calculate and accumulate loss.922# 6) Perform backpropagation.923# 7) Clip gradients.924# 8) Update encoder and decoder model parameters.925#926#927# .. Note ::928#929# PyTorch’s RNN modules (``RNN``, ``LSTM``, ``GRU``) can be used like any930# other non-recurrent layers by simply passing them the entire input931# sequence (or batch of sequences). We use the ``GRU`` layer like this in932# the ``encoder``. The reality is that under the hood, there is an933# iterative process looping over each time step calculating hidden states.934# Alternatively, you can run these modules one time-step at a time. In935# this case, we manually loop over the sequences during the training936# process like we must do for the ``decoder`` model. As long as you937# maintain the correct conceptual model of these modules, implementing938# sequential models can be very straightforward.939#940#941942943def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,944encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):945946# Zero gradients947encoder_optimizer.zero_grad()948decoder_optimizer.zero_grad()949950# Set device options951input_variable = input_variable.to(device)952target_variable = target_variable.to(device)953mask = mask.to(device)954# Lengths for RNN packing should always be on the CPU955lengths = lengths.to("cpu")956957# Initialize variables958loss = 0959print_losses = []960n_totals = 0961962# Forward pass through encoder963encoder_outputs, encoder_hidden = encoder(input_variable, lengths)964965# Create initial decoder input (start with SOS tokens for each sentence)966decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])967decoder_input = decoder_input.to(device)968969# Set initial decoder hidden state to the encoder's final hidden state970decoder_hidden = encoder_hidden[:decoder.n_layers]971972# Determine if we are using teacher forcing this iteration973use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False974975# Forward batch of sequences through decoder one time step at a time976if use_teacher_forcing:977for t in range(max_target_len):978decoder_output, decoder_hidden = decoder(979decoder_input, decoder_hidden, encoder_outputs980)981# Teacher forcing: next input is current target982decoder_input = target_variable[t].view(1, -1)983# Calculate and accumulate loss984mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])985loss += mask_loss986print_losses.append(mask_loss.item() * nTotal)987n_totals += nTotal988else:989for t in range(max_target_len):990decoder_output, decoder_hidden = decoder(991decoder_input, decoder_hidden, encoder_outputs992)993# No teacher forcing: next input is decoder's own current output994_, topi = decoder_output.topk(1)995decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])996decoder_input = decoder_input.to(device)997# Calculate and accumulate loss998mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])999loss += mask_loss1000print_losses.append(mask_loss.item() * nTotal)1001n_totals += nTotal10021003# Perform backpropagation1004loss.backward()10051006# Clip gradients: gradients are modified in place1007_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)1008_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)10091010# Adjust model weights1011encoder_optimizer.step()1012decoder_optimizer.step()10131014return sum(print_losses) / n_totals101510161017######################################################################1018# Training iterations1019# ~~~~~~~~~~~~~~~~~~~1020#1021# It is finally time to tie the full training procedure together with the1022# data. The ``trainIters`` function is responsible for running1023# ``n_iterations`` of training given the passed models, optimizers, data,1024# etc. This function is quite self explanatory, as we have done the heavy1025# lifting with the ``train`` function.1026#1027# One thing to note is that when we save our model, we save a tarball1028# containing the encoder and decoder ``state_dicts`` (parameters), the1029# optimizers’ ``state_dicts``, the loss, the iteration, etc. Saving the model1030# in this way will give us the ultimate flexibility with the checkpoint.1031# After loading a checkpoint, we will be able to use the model parameters1032# to run inference, or we can continue training right where we left off.1033#10341035def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):10361037# Load batches for each iteration1038training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])1039for _ in range(n_iteration)]10401041# Initializations1042print('Initializing ...')1043start_iteration = 11044print_loss = 01045if loadFilename:1046start_iteration = checkpoint['iteration'] + 110471048# Training loop1049print("Training...")1050for iteration in range(start_iteration, n_iteration + 1):1051training_batch = training_batches[iteration - 1]1052# Extract fields from batch1053input_variable, lengths, target_variable, mask, max_target_len = training_batch10541055# Run a training iteration with batch1056loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,1057decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)1058print_loss += loss10591060# Print progress1061if iteration % print_every == 0:1062print_loss_avg = print_loss / print_every1063print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))1064print_loss = 010651066# Save checkpoint1067if (iteration % save_every == 0):1068directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))1069if not os.path.exists(directory):1070os.makedirs(directory)1071torch.save({1072'iteration': iteration,1073'en': encoder.state_dict(),1074'de': decoder.state_dict(),1075'en_opt': encoder_optimizer.state_dict(),1076'de_opt': decoder_optimizer.state_dict(),1077'loss': loss,1078'voc_dict': voc.__dict__,1079'embedding': embedding.state_dict()1080}, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))108110821083######################################################################1084# Define Evaluation1085# -----------------1086#1087# After training a model, we want to be able to talk to the bot ourselves.1088# First, we must define how we want the model to decode the encoded input.1089#1090# Greedy decoding1091# ~~~~~~~~~~~~~~~1092#1093# Greedy decoding is the decoding method that we use during training when1094# we are **NOT** using teacher forcing. In other words, for each time1095# step, we simply choose the word from ``decoder_output`` with the highest1096# softmax value. This decoding method is optimal on a single time-step1097# level.1098#1099# To facilitate the greedy decoding operation, we define a1100# ``GreedySearchDecoder`` class. When run, an object of this class takes1101# an input sequence (``input_seq``) of shape *(input_seq length, 1)*, a1102# scalar input length (``input_length``) tensor, and a ``max_length`` to1103# bound the response sentence length. The input sentence is evaluated1104# using the following computational graph:1105#1106# **Computation Graph:**1107#1108# 1) Forward input through encoder model.1109# 2) Prepare encoder's final hidden layer to be first hidden input to the decoder.1110# 3) Initialize decoder's first input as SOS_token.1111# 4) Initialize tensors to append decoded words to.1112# 5) Iteratively decode one word token at a time:1113# a) Forward pass through decoder.1114# b) Obtain most likely word token and its softmax score.1115# c) Record token and score.1116# d) Prepare current token to be next decoder input.1117# 6) Return collections of word tokens and scores.1118#11191120class GreedySearchDecoder(nn.Module):1121def __init__(self, encoder, decoder):1122super(GreedySearchDecoder, self).__init__()1123self.encoder = encoder1124self.decoder = decoder11251126def forward(self, input_seq, input_length, max_length):1127# Forward input through encoder model1128encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)1129# Prepare encoder's final hidden layer to be first hidden input to the decoder1130decoder_hidden = encoder_hidden[:decoder.n_layers]1131# Initialize decoder input with SOS_token1132decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token1133# Initialize tensors to append decoded words to1134all_tokens = torch.zeros([0], device=device, dtype=torch.long)1135all_scores = torch.zeros([0], device=device)1136# Iteratively decode one word token at a time1137for _ in range(max_length):1138# Forward pass through decoder1139decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)1140# Obtain most likely word token and its softmax score1141decoder_scores, decoder_input = torch.max(decoder_output, dim=1)1142# Record token and score1143all_tokens = torch.cat((all_tokens, decoder_input), dim=0)1144all_scores = torch.cat((all_scores, decoder_scores), dim=0)1145# Prepare current token to be next decoder input (add a dimension)1146decoder_input = torch.unsqueeze(decoder_input, 0)1147# Return collections of word tokens and scores1148return all_tokens, all_scores114911501151######################################################################1152# Evaluate my text1153# ~~~~~~~~~~~~~~~~1154#1155# Now that we have our decoding method defined, we can write functions for1156# evaluating a string input sentence. The ``evaluate`` function manages1157# the low-level process of handling the input sentence. We first format1158# the sentence as an input batch of word indexes with *batch_size==1*. We1159# do this by converting the words of the sentence to their corresponding1160# indexes, and transposing the dimensions to prepare the tensor for our1161# models. We also create a ``lengths`` tensor which contains the length of1162# our input sentence. In this case, ``lengths`` is scalar because we are1163# only evaluating one sentence at a time (batch_size==1). Next, we obtain1164# the decoded response sentence tensor using our ``GreedySearchDecoder``1165# object (``searcher``). Finally, we convert the response’s indexes to1166# words and return the list of decoded words.1167#1168# ``evaluateInput`` acts as the user interface for our chatbot. When1169# called, an input text field will spawn in which we can enter our query1170# sentence. After typing our input sentence and pressing *Enter*, our text1171# is normalized in the same way as our training data, and is ultimately1172# fed to the ``evaluate`` function to obtain a decoded output sentence. We1173# loop this process, so we can keep chatting with our bot until we enter1174# either “q” or “quit”.1175#1176# Finally, if a sentence is entered that contains a word that is not in1177# the vocabulary, we handle this gracefully by printing an error message1178# and prompting the user to enter another sentence.1179#11801181def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):1182### Format input sentence as a batch1183# words -> indexes1184indexes_batch = [indexesFromSentence(voc, sentence)]1185# Create lengths tensor1186lengths = torch.tensor([len(indexes) for indexes in indexes_batch])1187# Transpose dimensions of batch to match models' expectations1188input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)1189# Use appropriate device1190input_batch = input_batch.to(device)1191lengths = lengths.to("cpu")1192# Decode sentence with searcher1193tokens, scores = searcher(input_batch, lengths, max_length)1194# indexes -> words1195decoded_words = [voc.index2word[token.item()] for token in tokens]1196return decoded_words119711981199def evaluateInput(encoder, decoder, searcher, voc):1200input_sentence = ''1201while(1):1202try:1203# Get input sentence1204input_sentence = input('> ')1205# Check if it is quit case1206if input_sentence == 'q' or input_sentence == 'quit': break1207# Normalize sentence1208input_sentence = normalizeString(input_sentence)1209# Evaluate sentence1210output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)1211# Format and print response sentence1212output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]1213print('Bot:', ' '.join(output_words))12141215except KeyError:1216print("Error: Encountered unknown word.")121712181219######################################################################1220# Run Model1221# ---------1222#1223# Finally, it is time to run our model!1224#1225# Regardless of whether we want to train or test the chatbot model, we1226# must initialize the individual encoder and decoder models. In the1227# following block, we set our desired configurations, choose to start from1228# scratch or set a checkpoint to load from, and build and initialize the1229# models. Feel free to play with different model configurations to1230# optimize performance.1231#12321233# Configure models1234model_name = 'cb_model'1235attn_model = 'dot'1236#``attn_model = 'general'``1237#``attn_model = 'concat'``1238hidden_size = 5001239encoder_n_layers = 21240decoder_n_layers = 21241dropout = 0.11242batch_size = 6412431244# Set checkpoint to load from; set to None if starting from scratch1245loadFilename = None1246checkpoint_iter = 400012471248#############################################################1249# Sample code to load from a checkpoint:1250#1251# .. code-block:: python1252#1253# loadFilename = os.path.join(save_dir, model_name, corpus_name,1254# '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),1255# '{}_checkpoint.tar'.format(checkpoint_iter))12561257# Load model if a ``loadFilename`` is provided1258if loadFilename:1259# If loading on same machine the model was trained on1260checkpoint = torch.load(loadFilename)1261# If loading a model trained on GPU to CPU1262#checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))1263encoder_sd = checkpoint['en']1264decoder_sd = checkpoint['de']1265encoder_optimizer_sd = checkpoint['en_opt']1266decoder_optimizer_sd = checkpoint['de_opt']1267embedding_sd = checkpoint['embedding']1268voc.__dict__ = checkpoint['voc_dict']126912701271print('Building encoder and decoder ...')1272# Initialize word embeddings1273embedding = nn.Embedding(voc.num_words, hidden_size)1274if loadFilename:1275embedding.load_state_dict(embedding_sd)1276# Initialize encoder & decoder models1277encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)1278decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)1279if loadFilename:1280encoder.load_state_dict(encoder_sd)1281decoder.load_state_dict(decoder_sd)1282# Use appropriate device1283encoder = encoder.to(device)1284decoder = decoder.to(device)1285print('Models built and ready to go!')128612871288######################################################################1289# Run Training1290# ~~~~~~~~~~~~1291#1292# Run the following block if you want to train the model.1293#1294# First we set training parameters, then we initialize our optimizers, and1295# finally we call the ``trainIters`` function to run our training1296# iterations.1297#12981299# Configure training/optimization1300clip = 50.01301teacher_forcing_ratio = 1.01302learning_rate = 0.00011303decoder_learning_ratio = 5.01304n_iteration = 40001305print_every = 11306save_every = 50013071308# Ensure dropout layers are in train mode1309encoder.train()1310decoder.train()13111312# Initialize optimizers1313print('Building optimizers ...')1314encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)1315decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)1316if loadFilename:1317encoder_optimizer.load_state_dict(encoder_optimizer_sd)1318decoder_optimizer.load_state_dict(decoder_optimizer_sd)13191320# If you have CUDA, configure CUDA to call1321for state in encoder_optimizer.state.values():1322for k, v in state.items():1323if isinstance(v, torch.Tensor):1324state[k] = v.cuda()13251326for state in decoder_optimizer.state.values():1327for k, v in state.items():1328if isinstance(v, torch.Tensor):1329state[k] = v.cuda()13301331# Run training iterations1332print("Starting Training!")1333trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,1334embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,1335print_every, save_every, clip, corpus_name, loadFilename)133613371338######################################################################1339# Run Evaluation1340# ~~~~~~~~~~~~~~1341#1342# To chat with your model, run the following block.1343#13441345# Set dropout layers to ``eval`` mode1346encoder.eval()1347decoder.eval()13481349# Initialize search module1350searcher = GreedySearchDecoder(encoder, decoder)13511352# Begin chatting (uncomment and run the following line to begin)1353# evaluateInput(encoder, decoder, searcher, voc)135413551356######################################################################1357# Conclusion1358# ----------1359#1360# That’s all for this one, folks. Congratulations, you now know the1361# fundamentals to building a generative chatbot model! If you’re1362# interested, you can try tailoring the chatbot’s behavior by tweaking the1363# model and training parameters and customizing the data that you train1364# the model on.1365#1366# Check out the other tutorials for more cool deep learning applications1367# in PyTorch!1368#136913701371