Path: blob/master/examples/nlp/ipynb/masked_language_modeling.ipynb
3508 views
End-to-end Masked Language Modeling with BERT
Author: Ankur Singh
Date created: 2020/09/18
Last modified: 2024/03/15
Description: Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset.
Introduction
Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to try to predict what the masked word should be.
For an input that contains one or more mask tokens, the model will generate the most likely substitution for each.
Example:
Input: "I have watched this [MASK] and it was awesome."
Output: "I have watched this movie and it was awesome."
Masked language modeling is a great way to train a language model in a self-supervised setting (without human-annotated labels). Such a model can then be fine-tuned to accomplish various supervised NLP tasks.
This example teaches you how to build a BERT model from scratch, train it with the masked language modeling task, and then fine-tune this model on a sentiment classification task.
We will use the Keras TextVectorization
and MultiHeadAttention
layers to create a BERT Transformer-Encoder network architecture.
Note: This example should be run with tf-nightly
.
Setup
Install tf-nightly
via pip install tf-nightly
.
Set-up Configuration
Load the data
We will first download the IMDB data and load into a Pandas dataframe.
Dataset preparation
We will use the TextVectorization
layer to vectorize the text into integer token ids. It transforms a batch of strings into either a sequence of token indices (one sample = 1D array of integer token indices, in order) or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).
Below, we define 3 preprocessing functions.
The
get_vectorize_layer
function builds theTextVectorization
layer.The
encode
function encodes raw text into integer token ids.The
get_masked_input_and_labels
function will mask input token ids. It masks 15% of all input tokens in each sequence at random.
Create BERT model (Pretraining Model) for masked language modeling
We will create a BERT-like pretraining model architecture using the MultiHeadAttention
layer. It will take token ids as inputs (including masked tokens) and it will predict the correct ids for the masked input tokens.
Train and Save
Fine-tune a sentiment classification model
We will fine-tune our self-supervised model on a downstream task of sentiment classification. To do this, let's create a classifier by adding a pooling layer and a Dense
layer on top of the pretrained BERT features.
Create an end-to-end model and evaluate it
When you want to deploy a model, it's best if it already includes its preprocessing pipeline, so that you don't have to reimplement the preprocessing logic in your production environment. Let's create an end-to-end model that incorporates the TextVectorization
layer inside evalaute method, and let's evaluate. We will pass raw strings as input.