Path: blob/master/examples/audio/ipynb/uk_ireland_accent_recognition.ipynb
3508 views
English speaker accent recognition using Transfer Learning
Author: Fadi Badine
Date created: 2022/04/16
Last modified: 2022/04/16
Description: Training a model to classify UK & Ireland accents using feature extraction from Yamnet.
Introduction
The following example shows how to use feature extraction in order to train a model to classify the English accent spoken in an audio wave.
Instead of training a model from scratch, transfer learning enables us to take advantage of existing state-of-the-art deep learning models and use them as feature extractors.
Our process:
Use a TF Hub pre-trained model (Yamnet) and apply it as part of the tf.data pipeline which transforms the audio files into feature vectors.
Train a dense model on the feature vectors.
Use the trained model for inference on a new audio file.
Note:
We need to install TensorFlow IO in order to resample audio files to 16 kHz as required by Yamnet model.
In the test section, ffmpeg is used to convert the mp3 file to wav.
You can install TensorFlow IO with the following command:
Configuration
Imports
Yamnet Model
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology. It is available on TensorFlow Hub.
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz. As output, the model returns a 3-tuple:
Scores of shape
(N, 521)
representing the scores of the 521 classes.Embeddings of shape
(N, 1024)
.The log-mel spectrogram of the entire audio frame.
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
For more detailed information about Yamnet, please refer to its TensorFlow Hub page.
Dataset
The dataset used is the Crowdsourced high-quality UK and Ireland English Dialect speech data set which consists of a total of 17,877 high-quality audio wav files.
This dataset includes over 31 hours of recording from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
For more info, please refer to the above link or to the following paper: Open-source Multi-speaker Corpora of the English Accents in the British Isles
Download the data
Load the data in a Dataframe
Of the 3 columns (ID, filename and transcript), we are only interested in the filename column in order to read the audio file. We will ignore the other two.
Let's now preprocess the dataset by:
Adjusting the filename (removing a leading space & adding ".wav" extension to the filename).
Creating a label using the first 2 characters of the filename which indicate the accent.
Shuffling the samples.
Prepare training & validation sets
Let's split the samples creating training and validation sets.
Prepare a TensorFlow Dataset
Next, we need to create a tf.data.Dataset
. This is done by creating a dataframe_to_dataset
function that does the following:
Create a dataset using filenames and labels.
Get the Yamnet embeddings by calling another function
filepath_to_embeddings
.Apply caching, reshuffling and setting batch size.
The filepath_to_embeddings
does the following:
Load audio file.
Resample audio to 16 kHz.
Generate scores and embeddings from Yamnet model.
Since Yamnet generates multiple samples for each audio file, this function also duplicates the label for all the generated samples that have
score=0
(speech) whereas sets the label for the others as 'other' indicating that this audio segment is not a speech and we won't label it as one of the accents.
The below load_16k_audio_file
is copied from the following tutorial Transfer learning with YAMNet for environmental sound classification
Build the model
The model that we use consists of:
An input layer which is the embedding output of the Yamnet classifier.
4 dense hidden layers and 4 dropout layers.
An output dense layer.
The model's hyperparameters were selected using KerasTuner.
Class weights calculation
Since the dataset is quite unbalanced, we will use class_weight
argument during training.
Getting the class weights is a little tricky because even though we know the number of audio files for each class, it does not represent the number of samples for that class since Yamnet transforms each audio file into multiple audio samples of 0.96 seconds each. So every audio file will be split into a number of samples that is proportional to its length.
Therefore, to get those weights, we have to calculate the number of samples for each class after preprocessing through Yamnet.
Callbacks
We use Keras callbacks in order to:
Stop whenever the validation AUC stops improving.
Save the best model.
Call TensorBoard in order to later view the training and validation logs.
Training
Results
Let's plot the training and validation AUC and accuracy.
Evaluation
Let's try to compare our model's performance to Yamnet's using one of Yamnet metrics (d-prime) Yamnet achieved a d-prime value of 2.318. Let's check our model's performance.
We can see that the model achieves the following results:
Results | Training | Validation |
---|---|---|
Accuracy | 54% | 51% |
AUC | 0.91 | 0.89 |
d-prime | 1.882 | 1.740 |
Confusion Matrix
Let's now plot the confusion matrix for the validation dataset.
The confusion matrix lets us see, for every class, not only how many samples were correctly classified, but also which other classes were the samples confused with.
It allows us to calculate the precision and recall for every class.
Precision & recall
For every class:
Recall is the ratio of correctly classified samples i.e. it shows how many samples of this specific class, the model is able to detect. It is the ratio of diagonal elements to the sum of all elements in the row.
Precision shows the accuracy of the classifier. It is the ratio of correctly predicted samples among the ones classified as belonging to this class. It is the ratio of diagonal elements to the sum of all elements in the column.
Run inference on test data
Let's now run a test on a single audio file. Let's check this example from The Scottish Voice
We will:
Download the mp3 file.
Convert it to a 16k wav file.
Run the model on the wav file.
Plot the results.
The below function yamnet_class_names_from_csv
was copied and very slightly changed from this Yamnet Notebook.
Let's run the model on the audio file:
Listen to the audio
The below function was copied from this Yamnet notebook and adjusted to our need.
This function plots the following:
Audio waveform
Mel spectrogram
Predictions for every time step