Path: blob/master/examples/audio/ipynb/speaker_recognition_using_cnn.ipynb
3508 views
Speaker Recognition
Author: Fadi Badine
Date created: 14/06/2020
Last modified: 19/07/2023
Description: Classify speakers using Fast Fourier Transform (FFT) and a 1D Convnet.
Introduction
This example demonstrates how to create a model to classify speakers from the frequency domain representation of speech recordings, obtained via Fast Fourier Transform (FFT).
It shows the following:
How to use
tf.data
to load, preprocess and feed audio streams into a modelHow to create a 1D convolutional network with residual connections for audio classification.
Our process:
We prepare a dataset of speech samples from different speakers, with the speaker as label.
We add background noise to these samples to augment our data.
We take the FFT of these samples.
We train a 1D convnet to predict the correct speaker given a noisy FFT speech sample.
Note:
This example should be run with TensorFlow 2.3 or higher, or
tf-nightly
.The noise samples in the dataset need to be resampled to a sampling rate of 16000 Hz before using the code in this example. In order to do this, you will need to have installed
ffmpg
.
Setup
Data preparation
The dataset is composed of 7 folders, divided into 2 groups:
Speech samples, with 5 folders for 5 different speakers. Each folder contains 1500 audio files, each 1 second long and sampled at 16000 Hz.
Background noise samples, with 2 folders and a total of 6 files. These files are longer than 1 second (and originally not sampled at 16000 Hz, but we will resample them to 16000 Hz). We will use those 6 files to create 354 1-second-long noise samples to be used for training.
Let's sort these 2 categories into 2 folders:
An
audio
folder which will contain all the per-speaker speech sample foldersA
noise
folder which will contain all the noise samples
Before sorting the audio and noise categories into 2 folders, we have the following directory structure:
After sorting, we end up with the following structure:
Noise preparation
In this section:
We load all noise samples (which should have been resampled to 16000)
We split those noise samples to chunks of 16000 samples which correspond to 1 second duration each
Resample all noise samples to 16000 Hz
Dataset generation
Model Definition
Training
Evaluation
We get ~ 98% validation accuracy.
Demonstration
Let's take some samples and:
Predict the speaker
Compare the prediction with the real speaker
Listen to the audio to see that despite the samples being noisy, the model is still pretty accurate