Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Attention Models/Week 3 - Question Answering/C4_W3_SentencePiece_and_BPE.ipynb
Views: 13373
SentencePiece and Byte Pair Encoding
Introduction to Tokenization
In order to process text in neural network models, it is first required to encode text as numbers with ids (such as the embedding vectors we've been using in the previous assignments), since the tensor operations act on numbers. Finally, if the output of the network are words, it is required to decode the predicted tokens ids back to text.
To encode text, the first decision that has to be made is to what level of granularity are we going to consider the text? Because ultimately, from these tokens, features are going to be created about them. Many different experiments have been carried out using words, morphological units, phonemic units, characters. For example,
But how to identify these units, such as words, are largely determined by the language they come from. For example, in many European languages a space is used to separate words, while in some Asian languages there are no spaces between words. Compare English and Mandarin.
Tokens are tricky. (original sentence)
令牌很棘手 (Mandarin)
Lìng pái hěn jí shǒu (pinyin)
令牌 很 棘手 (Mandarin with spaces)
So, the ability to tokenize, i.e. split text into meaningful fundamental units is not always straight-forward.
Also, there are practical issues of how large our vocabulary of words, vocab_size
, should be, considering memory limitations vs. coverage. A compromise between the finest-grained models employing characters which can be memory and more computationally efficient subword units such as n-grams or larger units need to be made.
In SentencePiece unicode characters are grouped together using either a unigram language model (used in this week's assignment) or BPE, byte-pair encoding. We will discuss BPE, since BERT and many of its variant uses a modified version of BPE and its pseudocode is easy to implement and understand... hopefully!
SentencePiece Preprocessing
NFKC Normalization
Unsurprisingly, even using unicode to initially tokenize text can be ambiguous, e.g.,
SentencePiece uses the Unicode standard Normalization form, NFKC, so this isn't an issue. Looking at our example from above again with normalization:
Normalization has actually changed the unicode code point (unicode unique id) for one of these two characters.
This normalization has other side effects which may be considered useful such as converting curly quotes “ to " their ASCII equivalent. (Although we now lose directionality of the quote...)
Lossless Tokenization*
SentencePiece also ensures that when you tokenize your data and detokenize your data the original position of white space is preserved. (However, tabs and newlines are converted to spaces, please try this experiment yourself later below.)
To ensure this lossless tokenization it replaces white space with _ (U+2581). So that a simple join of the replace underscores with spaces can restore the white space, even if there are consecutives symbols. But remember first to normalize and then replace spaces with _ (U+2581). As the following example shows.
So the special unicode underscore was replaced by the ASCII unicode. Reversing the order, we see that the special unicode underscore was retained.
BPE Algorithm
Now that we have discussed the preprocessing that SentencePiece performs we will go get our data, preprocess, and apply the BPE algorithm. We will show how this reproduces the tokenization produced by training SentencePiece on our example dataset (from this week's assignment).
Preparing our Data
First, we get our Squad data and process as above.
In the algorithm the vocab
variable is actually a frequency dictionary of the words. Further, those words have been prepended with an underscore to indicate that they are the beginning of a word. Finally, the characters have been delimited by spaces so that the BPE algorithm can group the most common characters together in the dictionary in a greedy fashion. We will see how that is exactly done shortly.
We check the size of the vocabulary (frequency dictionary) because this is the one hyperparameter that BPE depends on crucially on how far it breaks up a word into SentencePieces. It turns out that for our trained model on our small dataset that 60% of 455 merges of the most frequent characters need to be done to reproduce the upperlimit of a 32K vocab_size
over the entire corpus of examples.
BPE Algorithm
Directly from the BPE paper we have the following algorithm.
To understand what's going on first take a look at the third function get_sentence_piece_vocab
. It takes in the current vocab
word-frequency dictionary and the fraction of the total vocab_size
to merge characters in the words of the dictionary, num_merges
times. Then for each merge operation it get_stats
on how many of each pair of character sequences there are. It gets the most frequent pair of symbols as the best
pair. Then it merges those pair of symbols (removes the space between them) in each word in the vocab
that contains this best
(= pair
). Consquently, merge_vocab
creates a new vocab
, v_out
. This process is repeated num_merges
times and the result is the set of SentencePieces (keys of the final sp_vocab
).
Please feel free to skip the below if the above description was enough.
In a little more detail then, we can see in get_stats
we initially create a list of bigram frequencies (two character sequence) from our vocabulary. Later, this may include (trigrams, quadgrams, etc.). Note that the key of the pairs
frequency dictionary is actually a 2-tuple, which is just shorthand notation for a pair.
In merge_vocab
we take in an individual pair
(of character sequences, note this is the most frequency best
pair) and the current vocab
as v_in
. We create a new vocab
, v_out
, from the old by joining together the characters in the pair (removing the space), if they are present in the a word of the dictionary. Warning: the expression (?<!\S)
means that either whitespace character follows before the bigram
or there is nothing before (beginning of word) the bigram, similarly for (?!\S)
for preceding whitespace or end of word.
Train SentencePiece BPE Tokenizer on Example Data
Explore SentencePiece Model
First let us explore the SentencePiece model provided with this week's assignment. Remember you can always use Python's built in help
command to see the documentation for any object or method.
Let's work with the first sentence of our example text.
Notice how SentencePiece breaks the words into seemingly odd parts, but we've seen something similar from our work with BPE. But how close were we to this model trained on the whole corpus of examles with a vocab_size
of 32,000 instead of 455? Here you can also test what happens to white space, like '\n'.
But first let us note that SentencePiece encodes the SentencePieces, the tokens, and has reserved some of the ids as can be seen in this week's assignment.
We can also check what are the ids for the first part and last part of the vocabulary.
Train SentencePiece BPE model with our example.txt
Finally, let's train our own BPE model directly from the SentencePiece library and compare it to the results of our implemention of the algorithm from the BPE paper itself.
Our implementation of BPE's code from the paper matches up pretty well with the library itself! Difference are probably accounted for by the vocab_size
. There is also another technical difference in that in the SentencePiece implementation of BPE a priority queue is used to more efficiently keep track of the best pairs. Actually, there is a priority queue in the Python standard library called heapq
if you would like to give that a try below!
For a more extensive example consider looking at the SentencePiece repo. The last section of this code is repurposed from that tutorial. Thanks for your participation!