Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/audio/uk_ireland_accent_recognition.py
3507 views
1
"""
2
Title: English speaker accent recognition using Transfer Learning
3
Author: [Fadi Badine](https://twitter.com/fadibadine)
4
Date created: 2022/04/16
5
Last modified: 2022/04/16
6
Description: Training a model to classify UK & Ireland accents using feature extraction from Yamnet.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
The following example shows how to use feature extraction in order to
14
train a model to classify the English accent spoken in an audio wave.
15
16
Instead of training a model from scratch, transfer learning enables us to
17
take advantage of existing state-of-the-art deep learning models and use them as feature extractors.
18
19
Our process:
20
21
* Use a TF Hub pre-trained model (Yamnet) and apply it as part of the tf.data pipeline which transforms
22
the audio files into feature vectors.
23
* Train a dense model on the feature vectors.
24
* Use the trained model for inference on a new audio file.
25
26
Note:
27
28
* We need to install TensorFlow IO in order to resample audio files to 16 kHz as required by Yamnet model.
29
* In the test section, ffmpeg is used to convert the mp3 file to wav.
30
31
You can install TensorFlow IO with the following command:
32
"""
33
34
"""shell
35
pip install -U -q tensorflow_io
36
"""
37
38
"""
39
## Configuration
40
"""
41
42
SEED = 1337
43
EPOCHS = 100
44
BATCH_SIZE = 64
45
VALIDATION_RATIO = 0.1
46
MODEL_NAME = "uk_irish_accent_recognition"
47
48
# Location where the dataset will be downloaded.
49
# By default (None), keras.utils.get_file will use ~/.keras/ as the CACHE_DIR
50
CACHE_DIR = None
51
52
# The location of the dataset
53
URL_PATH = "https://www.openslr.org/resources/83/"
54
55
# List of datasets compressed files that contain the audio files
56
zip_files = {
57
0: "irish_english_male.zip",
58
1: "midlands_english_female.zip",
59
2: "midlands_english_male.zip",
60
3: "northern_english_female.zip",
61
4: "northern_english_male.zip",
62
5: "scottish_english_female.zip",
63
6: "scottish_english_male.zip",
64
7: "southern_english_female.zip",
65
8: "southern_english_male.zip",
66
9: "welsh_english_female.zip",
67
10: "welsh_english_male.zip",
68
}
69
70
# We see that there are 2 compressed files for each accent (except Irish):
71
# - One for male speakers
72
# - One for female speakers
73
# However, we will be using a gender agnostic dataset.
74
75
# List of gender agnostic categories
76
gender_agnostic_categories = [
77
"ir", # Irish
78
"mi", # Midlands
79
"no", # Northern
80
"sc", # Scottish
81
"so", # Southern
82
"we", # Welsh
83
]
84
85
class_names = [
86
"Irish",
87
"Midlands",
88
"Northern",
89
"Scottish",
90
"Southern",
91
"Welsh",
92
"Not a speech",
93
]
94
95
"""
96
## Imports
97
"""
98
99
import os
100
import io
101
import csv
102
import numpy as np
103
import pandas as pd
104
import tensorflow as tf
105
import tensorflow_hub as hub
106
import tensorflow_io as tfio
107
from tensorflow import keras
108
import matplotlib.pyplot as plt
109
import seaborn as sns
110
from scipy import stats
111
from IPython.display import Audio
112
113
114
# Set all random seeds in order to get reproducible results
115
keras.utils.set_random_seed(SEED)
116
117
# Where to download the dataset
118
DATASET_DESTINATION = os.path.join(CACHE_DIR if CACHE_DIR else "~/.keras/", "datasets")
119
120
"""
121
## Yamnet Model
122
123
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio
124
events from the AudioSet ontology. It is available on TensorFlow Hub.
125
126
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
127
As output, the model returns a 3-tuple:
128
129
* Scores of shape `(N, 521)` representing the scores of the 521 classes.
130
* Embeddings of shape `(N, 1024)`.
131
* The log-mel spectrogram of the entire audio frame.
132
133
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
134
135
For more detailed information about Yamnet, please refer to its [TensorFlow Hub](https://tfhub.dev/google/yamnet/1) page.
136
"""
137
138
yamnet_model = hub.load("https://tfhub.dev/google/yamnet/1")
139
140
"""
141
## Dataset
142
143
The dataset used is the
144
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
145
which consists of a total of 17,877 high-quality audio wav files.
146
147
This dataset includes over 31 hours of recording from 120 volunteers who self-identify as
148
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
149
150
For more info, please refer to the above link or to the following paper:
151
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
152
"""
153
154
"""
155
## Download the data
156
"""
157
158
# CSV file that contains information about the dataset. For each entry, we have:
159
# - ID
160
# - wav file name
161
# - transcript
162
line_index_file = keras.utils.get_file(
163
fname="line_index_file", origin=URL_PATH + "line_index_all.csv"
164
)
165
166
# Download the list of compressed files that contain the audio wav files
167
for i in zip_files:
168
fname = zip_files[i].split(".")[0]
169
url = URL_PATH + zip_files[i]
170
171
zip_file = keras.utils.get_file(fname=fname, origin=url, extract=True)
172
os.remove(zip_file)
173
174
"""
175
## Load the data in a Dataframe
176
177
Of the 3 columns (ID, filename and transcript), we are only interested in the filename column in order to read the audio file.
178
We will ignore the other two.
179
"""
180
181
dataframe = pd.read_csv(
182
line_index_file, names=["id", "filename", "transcript"], usecols=["filename"]
183
)
184
dataframe.head()
185
186
"""
187
Let's now preprocess the dataset by:
188
189
* Adjusting the filename (removing a leading space & adding ".wav" extension to the
190
filename).
191
* Creating a label using the first 2 characters of the filename which indicate the
192
accent.
193
* Shuffling the samples.
194
"""
195
196
197
# The purpose of this function is to preprocess the dataframe by applying the following:
198
# - Cleaning the filename from a leading space
199
# - Generating a label column that is gender agnostic i.e.
200
# welsh english male and welsh english female for example are both labeled as
201
# welsh english
202
# - Add extension .wav to the filename
203
# - Shuffle samples
204
def preprocess_dataframe(dataframe):
205
# Remove leading space in filename column
206
dataframe["filename"] = dataframe.apply(lambda row: row["filename"].strip(), axis=1)
207
208
# Create gender agnostic labels based on the filename first 2 letters
209
dataframe["label"] = dataframe.apply(
210
lambda row: gender_agnostic_categories.index(row["filename"][:2]), axis=1
211
)
212
213
# Add the file path to the name
214
dataframe["filename"] = dataframe.apply(
215
lambda row: os.path.join(DATASET_DESTINATION, row["filename"] + ".wav"), axis=1
216
)
217
218
# Shuffle the samples
219
dataframe = dataframe.sample(frac=1, random_state=SEED).reset_index(drop=True)
220
221
return dataframe
222
223
224
dataframe = preprocess_dataframe(dataframe)
225
dataframe.head()
226
227
"""
228
## Prepare training & validation sets
229
230
Let's split the samples creating training and validation sets.
231
"""
232
233
split = int(len(dataframe) * (1 - VALIDATION_RATIO))
234
train_df = dataframe[:split]
235
valid_df = dataframe[split:]
236
237
print(
238
f"We have {train_df.shape[0]} training samples & {valid_df.shape[0]} validation ones"
239
)
240
241
"""
242
## Prepare a TensorFlow Dataset
243
244
Next, we need to create a `tf.data.Dataset`.
245
This is done by creating a `dataframe_to_dataset` function that does the following:
246
247
* Create a dataset using filenames and labels.
248
* Get the Yamnet embeddings by calling another function `filepath_to_embeddings`.
249
* Apply caching, reshuffling and setting batch size.
250
251
The `filepath_to_embeddings` does the following:
252
253
* Load audio file.
254
* Resample audio to 16 kHz.
255
* Generate scores and embeddings from Yamnet model.
256
* Since Yamnet generates multiple samples for each audio file,
257
this function also duplicates the label for all the generated samples
258
that have `score=0` (speech) whereas sets the label for the others as
259
'other' indicating that this audio segment is not a speech and we won't label it as one of the accents.
260
261
The below `load_16k_audio_file` is copied from the following tutorial
262
[Transfer learning with YAMNet for environmental sound classification](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)
263
"""
264
265
266
@tf.function
267
def load_16k_audio_wav(filename):
268
# Read file content
269
file_content = tf.io.read_file(filename)
270
271
# Decode audio wave
272
audio_wav, sample_rate = tf.audio.decode_wav(file_content, desired_channels=1)
273
audio_wav = tf.squeeze(audio_wav, axis=-1)
274
sample_rate = tf.cast(sample_rate, dtype=tf.int64)
275
276
# Resample to 16k
277
audio_wav = tfio.audio.resample(audio_wav, rate_in=sample_rate, rate_out=16000)
278
279
return audio_wav
280
281
282
def filepath_to_embeddings(filename, label):
283
# Load 16k audio wave
284
audio_wav = load_16k_audio_wav(filename)
285
286
# Get audio embeddings & scores.
287
# The embeddings are the audio features extracted using transfer learning
288
# while scores will be used to identify time slots that are not speech
289
# which will then be gathered into a specific new category 'other'
290
scores, embeddings, _ = yamnet_model(audio_wav)
291
292
# Number of embeddings in order to know how many times to repeat the label
293
embeddings_num = tf.shape(embeddings)[0]
294
labels = tf.repeat(label, embeddings_num)
295
296
# Change labels for time-slots that are not speech into a new category 'other'
297
labels = tf.where(tf.argmax(scores, axis=1) == 0, label, len(class_names) - 1)
298
299
# Using one-hot in order to use AUC
300
return (embeddings, tf.one_hot(labels, len(class_names)))
301
302
303
def dataframe_to_dataset(dataframe, batch_size=64):
304
dataset = tf.data.Dataset.from_tensor_slices(
305
(dataframe["filename"], dataframe["label"])
306
)
307
308
dataset = dataset.map(
309
lambda x, y: filepath_to_embeddings(x, y),
310
num_parallel_calls=tf.data.experimental.AUTOTUNE,
311
).unbatch()
312
313
return dataset.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
314
315
316
train_ds = dataframe_to_dataset(train_df)
317
valid_ds = dataframe_to_dataset(valid_df)
318
319
"""
320
## Build the model
321
322
The model that we use consists of:
323
324
* An input layer which is the embedding output of the Yamnet classifier.
325
* 4 dense hidden layers and 4 dropout layers.
326
* An output dense layer.
327
328
The model's hyperparameters were selected using
329
[KerasTuner](https://keras.io/keras_tuner/).
330
"""
331
332
keras.backend.clear_session()
333
334
335
def build_and_compile_model():
336
inputs = keras.layers.Input(shape=(1024), name="embedding")
337
338
x = keras.layers.Dense(256, activation="relu", name="dense_1")(inputs)
339
x = keras.layers.Dropout(0.15, name="dropout_1")(x)
340
341
x = keras.layers.Dense(384, activation="relu", name="dense_2")(x)
342
x = keras.layers.Dropout(0.2, name="dropout_2")(x)
343
344
x = keras.layers.Dense(192, activation="relu", name="dense_3")(x)
345
x = keras.layers.Dropout(0.25, name="dropout_3")(x)
346
347
x = keras.layers.Dense(384, activation="relu", name="dense_4")(x)
348
x = keras.layers.Dropout(0.2, name="dropout_4")(x)
349
350
outputs = keras.layers.Dense(len(class_names), activation="softmax", name="ouput")(
351
x
352
)
353
354
model = keras.Model(inputs=inputs, outputs=outputs, name="accent_recognition")
355
356
model.compile(
357
optimizer=keras.optimizers.Adam(learning_rate=1.9644e-5),
358
loss=keras.losses.CategoricalCrossentropy(),
359
metrics=["accuracy", keras.metrics.AUC(name="auc")],
360
)
361
362
return model
363
364
365
model = build_and_compile_model()
366
model.summary()
367
368
"""
369
## Class weights calculation
370
371
Since the dataset is quite unbalanced, we will use `class_weight` argument during training.
372
373
Getting the class weights is a little tricky because even though we know the number of
374
audio files for each class, it does not represent the number of samples for that class
375
since Yamnet transforms each audio file into multiple audio samples of 0.96 seconds each.
376
So every audio file will be split into a number of samples that is proportional to its length.
377
378
Therefore, to get those weights, we have to calculate the number of samples for each class
379
after preprocessing through Yamnet.
380
"""
381
382
class_counts = tf.zeros(shape=(len(class_names),), dtype=tf.int32)
383
384
for x, y in iter(train_ds):
385
class_counts = class_counts + tf.math.bincount(
386
tf.cast(tf.math.argmax(y, axis=1), tf.int32), minlength=len(class_names)
387
)
388
389
class_weight = {
390
i: tf.math.reduce_sum(class_counts).numpy() / class_counts[i].numpy()
391
for i in range(len(class_counts))
392
}
393
394
print(class_weight)
395
396
"""
397
## Callbacks
398
399
We use Keras callbacks in order to:
400
401
* Stop whenever the validation AUC stops improving.
402
* Save the best model.
403
* Call TensorBoard in order to later view the training and validation logs.
404
"""
405
406
early_stopping_cb = keras.callbacks.EarlyStopping(
407
monitor="val_auc", patience=10, restore_best_weights=True
408
)
409
410
model_checkpoint_cb = keras.callbacks.ModelCheckpoint(
411
MODEL_NAME + ".h5", monitor="val_auc", save_best_only=True
412
)
413
414
tensorboard_cb = keras.callbacks.TensorBoard(
415
os.path.join(os.curdir, "logs", model.name)
416
)
417
418
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
419
420
"""
421
## Training
422
"""
423
424
history = model.fit(
425
train_ds,
426
epochs=EPOCHS,
427
validation_data=valid_ds,
428
class_weight=class_weight,
429
callbacks=callbacks,
430
verbose=2,
431
)
432
433
"""
434
## Results
435
436
Let's plot the training and validation AUC and accuracy.
437
"""
438
439
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
440
441
axs[0].plot(range(EPOCHS), history.history["accuracy"], label="Training")
442
axs[0].plot(range(EPOCHS), history.history["val_accuracy"], label="Validation")
443
axs[0].set_xlabel("Epochs")
444
axs[0].set_title("Training & Validation Accuracy")
445
axs[0].legend()
446
axs[0].grid(True)
447
448
axs[1].plot(range(EPOCHS), history.history["auc"], label="Training")
449
axs[1].plot(range(EPOCHS), history.history["val_auc"], label="Validation")
450
axs[1].set_xlabel("Epochs")
451
axs[1].set_title("Training & Validation AUC")
452
axs[1].legend()
453
axs[1].grid(True)
454
455
plt.show()
456
457
"""
458
## Evaluation
459
"""
460
461
train_loss, train_acc, train_auc = model.evaluate(train_ds)
462
valid_loss, valid_acc, valid_auc = model.evaluate(valid_ds)
463
464
"""
465
Let's try to compare our model's performance to Yamnet's using one of Yamnet metrics (d-prime)
466
Yamnet achieved a d-prime value of 2.318.
467
Let's check our model's performance.
468
"""
469
470
471
# The following function calculates the d-prime score from the AUC
472
def d_prime(auc):
473
standard_normal = stats.norm()
474
d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
475
return d_prime
476
477
478
print(
479
"train d-prime: {0:.3f}, validation d-prime: {1:.3f}".format(
480
d_prime(train_auc), d_prime(valid_auc)
481
)
482
)
483
484
"""
485
We can see that the model achieves the following results:
486
487
Results | Training | Validation
488
-----------|-----------|------------
489
Accuracy | 54% | 51%
490
AUC | 0.91 | 0.89
491
d-prime | 1.882 | 1.740
492
493
"""
494
495
"""
496
## Confusion Matrix
497
498
Let's now plot the confusion matrix for the validation dataset.
499
500
The confusion matrix lets us see, for every class, not only how many samples were correctly classified,
501
but also which other classes were the samples confused with.
502
503
It allows us to calculate the precision and recall for every class.
504
"""
505
506
# Create x and y tensors
507
x_valid = None
508
y_valid = None
509
510
for x, y in iter(valid_ds):
511
if x_valid is None:
512
x_valid = x.numpy()
513
y_valid = y.numpy()
514
else:
515
x_valid = np.concatenate((x_valid, x.numpy()), axis=0)
516
y_valid = np.concatenate((y_valid, y.numpy()), axis=0)
517
518
# Generate predictions
519
y_pred = model.predict(x_valid)
520
521
# Calculate confusion matrix
522
confusion_mtx = tf.math.confusion_matrix(
523
np.argmax(y_valid, axis=1), np.argmax(y_pred, axis=1)
524
)
525
526
# Plot the confusion matrix
527
plt.figure(figsize=(10, 8))
528
sns.heatmap(
529
confusion_mtx, xticklabels=class_names, yticklabels=class_names, annot=True, fmt="g"
530
)
531
plt.xlabel("Prediction")
532
plt.ylabel("Label")
533
plt.title("Validation Confusion Matrix")
534
plt.show()
535
536
"""
537
## Precision & recall
538
539
For every class:
540
541
* Recall is the ratio of correctly classified samples i.e. it shows how many samples
542
of this specific class, the model is able to detect.
543
It is the ratio of diagonal elements to the sum of all elements in the row.
544
* Precision shows the accuracy of the classifier. It is the ratio of correctly predicted
545
samples among the ones classified as belonging to this class.
546
It is the ratio of diagonal elements to the sum of all elements in the column.
547
"""
548
549
for i, label in enumerate(class_names):
550
precision = confusion_mtx[i, i] / np.sum(confusion_mtx[:, i])
551
recall = confusion_mtx[i, i] / np.sum(confusion_mtx[i, :])
552
print(
553
"{0:15} Precision:{1:.2f}%; Recall:{2:.2f}%".format(
554
label, precision * 100, recall * 100
555
)
556
)
557
558
"""
559
## Run inference on test data
560
561
Let's now run a test on a single audio file.
562
Let's check this example from [The Scottish Voice](https://www.thescottishvoice.org.uk/home/)
563
564
We will:
565
566
* Download the mp3 file.
567
* Convert it to a 16k wav file.
568
* Run the model on the wav file.
569
* Plot the results.
570
"""
571
572
filename = "audio-sample-Stuart"
573
url = "https://www.thescottishvoice.org.uk/files/cm/files/"
574
575
if os.path.exists(filename + ".wav") == False:
576
print(f"Downloading {filename}.mp3 from {url}")
577
command = f"wget {url}{filename}.mp3"
578
os.system(command)
579
580
print(f"Converting mp3 to wav and resampling to 16 kHZ")
581
command = (
582
f"ffmpeg -hide_banner -loglevel panic -y -i {filename}.mp3 -acodec "
583
f"pcm_s16le -ac 1 -ar 16000 {filename}.wav"
584
)
585
os.system(command)
586
587
filename = filename + ".wav"
588
589
590
"""
591
The below function `yamnet_class_names_from_csv` was copied and very slightly changed
592
from this [Yamnet Notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb).
593
"""
594
595
596
def yamnet_class_names_from_csv(yamnet_class_map_csv_text):
597
"""Returns list of class names corresponding to score vector."""
598
yamnet_class_map_csv = io.StringIO(yamnet_class_map_csv_text)
599
yamnet_class_names = [
600
name for (class_index, mid, name) in csv.reader(yamnet_class_map_csv)
601
]
602
yamnet_class_names = yamnet_class_names[1:] # Skip CSV header
603
return yamnet_class_names
604
605
606
yamnet_class_map_path = yamnet_model.class_map_path().numpy()
607
yamnet_class_names = yamnet_class_names_from_csv(
608
tf.io.read_file(yamnet_class_map_path).numpy().decode("utf-8")
609
)
610
611
612
def calculate_number_of_non_speech(scores):
613
number_of_non_speech = tf.math.reduce_sum(
614
tf.where(tf.math.argmax(scores, axis=1, output_type=tf.int32) != 0, 1, 0)
615
)
616
617
return number_of_non_speech
618
619
620
def filename_to_predictions(filename):
621
# Load 16k audio wave
622
audio_wav = load_16k_audio_wav(filename)
623
624
# Get audio embeddings & scores.
625
scores, embeddings, mel_spectrogram = yamnet_model(audio_wav)
626
627
print(
628
"Out of {} samples, {} are not speech".format(
629
scores.shape[0], calculate_number_of_non_speech(scores)
630
)
631
)
632
633
# Predict the output of the accent recognition model with embeddings as input
634
predictions = model.predict(embeddings)
635
636
return audio_wav, predictions, mel_spectrogram
637
638
639
"""
640
Let's run the model on the audio file:
641
"""
642
643
audio_wav, predictions, mel_spectrogram = filename_to_predictions(filename)
644
645
infered_class = class_names[predictions.mean(axis=0).argmax()]
646
print(f"The main accent is: {infered_class} English")
647
648
"""
649
Listen to the audio
650
"""
651
652
Audio(audio_wav, rate=16000)
653
654
"""
655
The below function was copied from this [Yamnet notebook](tinyurl.com/4a8xn7at) and adjusted to our need.
656
657
This function plots the following:
658
659
* Audio waveform
660
* Mel spectrogram
661
* Predictions for every time step
662
"""
663
664
plt.figure(figsize=(10, 6))
665
666
# Plot the waveform.
667
plt.subplot(3, 1, 1)
668
plt.plot(audio_wav)
669
plt.xlim([0, len(audio_wav)])
670
671
# Plot the log-mel spectrogram (returned by the model).
672
plt.subplot(3, 1, 2)
673
plt.imshow(
674
mel_spectrogram.numpy().T, aspect="auto", interpolation="nearest", origin="lower"
675
)
676
677
# Plot and label the model output scores for the top-scoring classes.
678
mean_predictions = np.mean(predictions, axis=0)
679
680
top_class_indices = np.argsort(mean_predictions)[::-1]
681
plt.subplot(3, 1, 3)
682
plt.imshow(
683
predictions[:, top_class_indices].T,
684
aspect="auto",
685
interpolation="nearest",
686
cmap="gray_r",
687
)
688
689
# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
690
# values from the model documentation
691
patch_padding = (0.025 / 2) / 0.01
692
plt.xlim([-patch_padding - 0.5, predictions.shape[0] + patch_padding - 0.5])
693
# Label the top_N classes.
694
yticks = range(0, len(class_names), 1)
695
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
696
_ = plt.ylim(-0.5 + np.array([len(class_names), 0]))
697
698