Path: blob/master/examples/audio/ipynb/ctc_asr.ipynb
3508 views
Automatic Speech Recognition using CTC
Authors: Mohamed Reda Bouadjenek and Ngoc Dung Huynh
Date created: 2021/09/26
Last modified: 2021/09/26
Description: Training a CTC-based model for automatic speech recognition.
Introduction
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields.
This demonstration shows how to combine a 2D CNN, RNN and a Connectionist Temporal Classification (CTC) loss to build an ASR. CTC is an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems. CTC is used when we don’t know how the input aligns with the output (how the characters in the transcript align to the audio). The model we create is similar to DeepSpeech2.
We will use the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books.
We will evaluate the quality of the model using Word Error Rate (WER). WER is obtained by adding up the substitutions, insertions, and deletions that occur in a sequence of recognized words. Divide that number by the total number of words originally spoken. The result is the WER. To get the WER score you need to install the jiwer package. You can use the following command line:
References:
Setup
Load the LJSpeech Dataset
Let's download the LJSpeech Dataset. The dataset contains 13,100 audio files as wav
files in the /wavs/
folder. The label (transcript) for each audio file is a string given in the metadata.csv
file. The fields are:
ID: this is the name of the corresponding .wav file
Transcription: words spoken by the reader (UTF-8)
Normalized transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).
For this demo we will use on the "Normalized transcription" field.
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22,050 Hz.
We now split the data into training and validation set.
Preprocessing
We first prepare the vocabulary to be used.
Next, we create the function that describes the transformation that we apply to each element of our dataset.
Creating Dataset
objects
We create a tf.data.Dataset
object that yields the transformed elements, in the same order as they appeared in the input.
Visualize the data
Let's visualize an example in our dataset, including the audio clip, the spectrogram and the corresponding label.
Model
We first define the CTC Loss function.
We now define our model. We will define a model similar to DeepSpeech2.
Training and Evaluating
Let's start the training process.
Inference
Conclusion
In practice, you should train for around 50 epochs or more. Each epoch takes approximately 5-6mn using a GeForce RTX 2080 Ti
GPU. The model we trained at 50 epochs has a Word Error Rate (WER) ≈ 16% to 17%
.
Some of the transcriptions around epoch 50:
Audio file: LJ017-0009.wav
Audio file: LJ003-0340.wav
Audio file: LJ011-0136.wav
Example available on HuggingFace.