CoCalc -- uk_ireland_accent

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/audio/uk_ireland_accent_recognition.py
³⁵⁰⁷ views
1
"""
2
Title: English speaker accent recognition using Transfer Learning
3
Author: [Fadi Badine](https://twitter.com/fadibadine)
4
Date created: 2022/04/16
5
Last modified: 2022/04/16
6
Description: Training a model to classify UK & Ireland accents using feature extraction from Yamnet.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
The following example shows how to use feature extraction in order to
14
train a model to classify the English accent spoken in an audio wave.
15

16
Instead of training a model from scratch, transfer learning enables us to
17
take advantage of existing state-of-the-art deep learning models and use them as feature extractors.
18

19
Our process:
20

21
* Use a TF Hub pre-trained model (Yamnet) and apply it as part of the tf.data pipeline which transforms
22
the audio files into feature vectors.
23
* Train a dense model on the feature vectors.
24
* Use the trained model for inference on a new audio file.
25

26
Note:
27

28
* We need to install TensorFlow IO in order to resample audio files to 16 kHz as required by Yamnet model.
29
* In the test section, ffmpeg is used to convert the mp3 file to wav.
30

31
You can install TensorFlow IO with the following command:
32
"""
33

34
"""shell
35
pip install -U -q tensorflow_io
36
"""
37

38
"""
39
## Configuration
40
"""
41

42
SEED = 1337
43
EPOCHS = 100
44
BATCH_SIZE = 64
45
VALIDATION_RATIO = 0.1
46
MODEL_NAME = "uk_irish_accent_recognition"
47

48
# Location where the dataset will be downloaded.
49
# By default (None), keras.utils.get_file will use ~/.keras/ as the CACHE_DIR
50
CACHE_DIR = None
51

52
# The location of the dataset
53
URL_PATH = "https://www.openslr.org/resources/83/"
54

55
# List of datasets compressed files that contain the audio files
56
zip_files = {
57
    0: "irish_english_male.zip",
58
    1: "midlands_english_female.zip",
59
    2: "midlands_english_male.zip",
60
    3: "northern_english_female.zip",
61
    4: "northern_english_male.zip",
62
    5: "scottish_english_female.zip",
63
    6: "scottish_english_male.zip",
64
    7: "southern_english_female.zip",
65
    8: "southern_english_male.zip",
66
    9: "welsh_english_female.zip",
67
    10: "welsh_english_male.zip",
68
}
69

70
# We see that there are 2 compressed files for each accent (except Irish):
71
# - One for male speakers
72
# - One for female speakers
73
# However, we will be using a gender agnostic dataset.
74

75
# List of gender agnostic categories
76
gender_agnostic_categories = [
77
    "ir",  # Irish
78
    "mi",  # Midlands
79
    "no",  # Northern
80
    "sc",  # Scottish
81
    "so",  # Southern
82
    "we",  # Welsh
83
]
84

85
class_names = [
86
    "Irish",
87
    "Midlands",
88
    "Northern",
89
    "Scottish",
90
    "Southern",
91
    "Welsh",
92
    "Not a speech",
93
]
94

95
"""
96
## Imports
97
"""
98

99
import os
100
import io
101
import csv
102
import numpy as np
103
import pandas as pd
104
import tensorflow as tf
105
import tensorflow_hub as hub
106
import tensorflow_io as tfio
107
from tensorflow import keras
108
import matplotlib.pyplot as plt
109
import seaborn as sns
110
from scipy import stats
111
from IPython.display import Audio
112

113

114
# Set all random seeds in order to get reproducible results
115
keras.utils.set_random_seed(SEED)
116

117
# Where to download the dataset
118
DATASET_DESTINATION = os.path.join(CACHE_DIR if CACHE_DIR else "~/.keras/", "datasets")
119

120
"""
121
## Yamnet Model
122

123
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio
124
events from the AudioSet ontology. It is available on TensorFlow Hub.
125

126
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
127
As output, the model returns a 3-tuple:
128

129
* Scores of shape `(N, 521)` representing the scores of the 521 classes.
130
* Embeddings of shape `(N, 1024)`.
131
* The log-mel spectrogram of the entire audio frame.
132

133
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
134

135
For more detailed information about Yamnet, please refer to its [TensorFlow Hub](https://tfhub.dev/google/yamnet/1) page.
136
"""
137

138
yamnet_model = hub.load("https://tfhub.dev/google/yamnet/1")
139

140
"""
141
## Dataset
142

143
The dataset used is the
144
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
145
which consists of a total of 17,877 high-quality audio wav files.
146

147
This dataset includes over 31 hours of recording from 120 volunteers who self-identify as
148
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
149

150
For more info, please refer to the above link or to the following paper:
151
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
152
"""
153

154
"""
155
## Download the data
156
"""
157

158
# CSV file that contains information about the dataset. For each entry, we have:
159
# - ID
160
# - wav file name
161
# - transcript
162
line_index_file = keras.utils.get_file(
163
    fname="line_index_file", origin=URL_PATH + "line_index_all.csv"
164
)
165

166
# Download the list of compressed files that contain the audio wav files
167
for i in zip_files:
168
    fname = zip_files[i].split(".")[0]
169
    url = URL_PATH + zip_files[i]
170

171
    zip_file = keras.utils.get_file(fname=fname, origin=url, extract=True)
172
    os.remove(zip_file)
173

174
"""
175
## Load the data in a Dataframe
176

177
Of the 3 columns (ID, filename and transcript), we are only interested in the filename column in order to read the audio file.
178
We will ignore the other two.
179
"""
180

181
dataframe = pd.read_csv(
182
    line_index_file, names=["id", "filename", "transcript"], usecols=["filename"]
183
)
184
dataframe.head()
185

186
"""
187
Let's now preprocess the dataset by:
188

189
* Adjusting the filename (removing a leading space & adding ".wav" extension to the
190
filename).
191
* Creating a label using the first 2 characters of the filename which indicate the
192
accent.
193
* Shuffling the samples.
194
"""
195

196

197
# The purpose of this function is to preprocess the dataframe by applying the following:
198
# - Cleaning the filename from a leading space
199
# - Generating a label column that is gender agnostic i.e.
200
#   welsh english male and welsh english female for example are both labeled as
201
#   welsh english
202
# - Add extension .wav to the filename
203
# - Shuffle samples
204
def preprocess_dataframe(dataframe):
205
    # Remove leading space in filename column
206
    dataframe["filename"] = dataframe.apply(lambda row: row["filename"].strip(), axis=1)
207

208
    # Create gender agnostic labels based on the filename first 2 letters
209
    dataframe["label"] = dataframe.apply(
210
        lambda row: gender_agnostic_categories.index(row["filename"][:2]), axis=1
211
    )
212

213
    # Add the file path to the name
214
    dataframe["filename"] = dataframe.apply(
215
        lambda row: os.path.join(DATASET_DESTINATION, row["filename"] + ".wav"), axis=1
216
    )
217

218
    # Shuffle the samples
219
    dataframe = dataframe.sample(frac=1, random_state=SEED).reset_index(drop=True)
220

221
    return dataframe
222

223

224
dataframe = preprocess_dataframe(dataframe)
225
dataframe.head()
226

227
"""
228
## Prepare training & validation sets
229

230
Let's split the samples creating training and validation sets.
231
"""
232

233
split = int(len(dataframe) * (1 - VALIDATION_RATIO))
234
train_df = dataframe[:split]
235
valid_df = dataframe[split:]
236

237
print(
238
    f"We have {train_df.shape[0]} training samples & {valid_df.shape[0]} validation ones"
239
)
240

241
"""
242
## Prepare a TensorFlow Dataset
243

244
Next, we need to create a `tf.data.Dataset`.
245
This is done by creating a `dataframe_to_dataset` function that does the following:
246

247
* Create a dataset using filenames and labels.
248
* Get the Yamnet embeddings by calling another function `filepath_to_embeddings`.
249
* Apply caching, reshuffling and setting batch size.
250

251
The `filepath_to_embeddings` does the following:
252

253
* Load audio file.
254
* Resample audio to 16 kHz.
255
* Generate scores and embeddings from Yamnet model.
256
* Since Yamnet generates multiple samples for each audio file,
257
this function also duplicates the label for all the generated samples
258
that have `score=0` (speech) whereas sets the label for the others as
259
'other' indicating that this audio segment is not a speech and we won't label it as one of the accents.
260

261
The below `load_16k_audio_file` is copied from the following tutorial
262
[Transfer learning with YAMNet for environmental sound classification](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)
263
"""
264

265

266
@tf.function
267
def load_16k_audio_wav(filename):
268
    # Read file content
269
    file_content = tf.io.read_file(filename)
270

271
    # Decode audio wave
272
    audio_wav, sample_rate = tf.audio.decode_wav(file_content, desired_channels=1)
273
    audio_wav = tf.squeeze(audio_wav, axis=-1)
274
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
275

276
    # Resample to 16k
277
    audio_wav = tfio.audio.resample(audio_wav, rate_in=sample_rate, rate_out=16000)
278

279
    return audio_wav
280

281

282
def filepath_to_embeddings(filename, label):
283
    # Load 16k audio wave
284
    audio_wav = load_16k_audio_wav(filename)
285

286
    # Get audio embeddings & scores.
287
    # The embeddings are the audio features extracted using transfer learning
288
    # while scores will be used to identify time slots that are not speech
289
    # which will then be gathered into a specific new category 'other'
290
    scores, embeddings, _ = yamnet_model(audio_wav)
291

292
    # Number of embeddings in order to know how many times to repeat the label
293
    embeddings_num = tf.shape(embeddings)[0]
294
    labels = tf.repeat(label, embeddings_num)
295

296
    # Change labels for time-slots that are not speech into a new category 'other'
297
    labels = tf.where(tf.argmax(scores, axis=1) == 0, label, len(class_names) - 1)
298

299
    # Using one-hot in order to use AUC
300
    return (embeddings, tf.one_hot(labels, len(class_names)))
301

302

303
def dataframe_to_dataset(dataframe, batch_size=64):
304
    dataset = tf.data.Dataset.from_tensor_slices(
305
        (dataframe["filename"], dataframe["label"])
306
    )
307

308
    dataset = dataset.map(
309
        lambda x, y: filepath_to_embeddings(x, y),
310
        num_parallel_calls=tf.data.experimental.AUTOTUNE,
311
    ).unbatch()
312

313
    return dataset.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
314

315

316
train_ds = dataframe_to_dataset(train_df)
317
valid_ds = dataframe_to_dataset(valid_df)
318

319
"""
320
## Build the model
321

322
The model that we use consists of:
323

324
* An input layer which is the embedding output of the Yamnet classifier.
325
* 4 dense hidden layers and 4 dropout layers.
326
* An output dense layer.
327

328
The model's hyperparameters were selected using
329
[KerasTuner](https://keras.io/keras_tuner/).
330
"""
331

332
keras.backend.clear_session()
333

334

335
def build_and_compile_model():
336
    inputs = keras.layers.Input(shape=(1024), name="embedding")
337

338
    x = keras.layers.Dense(256, activation="relu", name="dense_1")(inputs)
339
    x = keras.layers.Dropout(0.15, name="dropout_1")(x)
340

341
    x = keras.layers.Dense(384, activation="relu", name="dense_2")(x)
342
    x = keras.layers.Dropout(0.2, name="dropout_2")(x)
343

344
    x = keras.layers.Dense(192, activation="relu", name="dense_3")(x)
345
    x = keras.layers.Dropout(0.25, name="dropout_3")(x)
346

347
    x = keras.layers.Dense(384, activation="relu", name="dense_4")(x)
348
    x = keras.layers.Dropout(0.2, name="dropout_4")(x)
349

350
    outputs = keras.layers.Dense(len(class_names), activation="softmax", name="ouput")(
351
        x
352
    )
353

354
    model = keras.Model(inputs=inputs, outputs=outputs, name="accent_recognition")
355

356
    model.compile(
357
        optimizer=keras.optimizers.Adam(learning_rate=1.9644e-5),
358
        loss=keras.losses.CategoricalCrossentropy(),
359
        metrics=["accuracy", keras.metrics.AUC(name="auc")],
360
    )
361

362
    return model
363

364

365
model = build_and_compile_model()
366
model.summary()
367

368
"""
369
## Class weights calculation
370

371
Since the dataset is quite unbalanced, we will use `class_weight` argument during training.
372

373
Getting the class weights is a little tricky because even though we know the number of
374
audio files for each class, it does not represent the number of samples for that class
375
since Yamnet transforms each audio file into multiple audio samples of 0.96 seconds each.
376
So every audio file will be split into a number of samples that is proportional to its length.
377

378
Therefore, to get those weights, we have to calculate the number of samples for each class
379
after preprocessing through Yamnet.
380
"""
381

382
class_counts = tf.zeros(shape=(len(class_names),), dtype=tf.int32)
383

384
for x, y in iter(train_ds):
385
    class_counts = class_counts + tf.math.bincount(
386
        tf.cast(tf.math.argmax(y, axis=1), tf.int32), minlength=len(class_names)
387
    )
388

389
class_weight = {
390
    i: tf.math.reduce_sum(class_counts).numpy() / class_counts[i].numpy()
391
    for i in range(len(class_counts))
392
}
393

394
print(class_weight)
395

396
"""
397
## Callbacks
398

399
We use Keras callbacks in order to:
400

401
* Stop whenever the validation AUC stops improving.
402
* Save the best model.
403
* Call TensorBoard in order to later view the training and validation logs.
404
"""
405

406
early_stopping_cb = keras.callbacks.EarlyStopping(
407
    monitor="val_auc", patience=10, restore_best_weights=True
408
)
409

410
model_checkpoint_cb = keras.callbacks.ModelCheckpoint(
411
    MODEL_NAME + ".h5", monitor="val_auc", save_best_only=True
412
)
413

414
tensorboard_cb = keras.callbacks.TensorBoard(
415
    os.path.join(os.curdir, "logs", model.name)
416
)
417

418
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
419

420
"""
421
## Training
422
"""
423

424
history = model.fit(
425
    train_ds,
426
    epochs=EPOCHS,
427
    validation_data=valid_ds,
428
    class_weight=class_weight,
429
    callbacks=callbacks,
430
    verbose=2,
431
)
432

433
"""
434
## Results
435

436
Let's plot the training and validation AUC and accuracy.
437
"""
438

439
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
440

441
axs[0].plot(range(EPOCHS), history.history["accuracy"], label="Training")
442
axs[0].plot(range(EPOCHS), history.history["val_accuracy"], label="Validation")
443
axs[0].set_xlabel("Epochs")
444
axs[0].set_title("Training & Validation Accuracy")
445
axs[0].legend()
446
axs[0].grid(True)
447

448
axs[1].plot(range(EPOCHS), history.history["auc"], label="Training")
449
axs[1].plot(range(EPOCHS), history.history["val_auc"], label="Validation")
450
axs[1].set_xlabel("Epochs")
451
axs[1].set_title("Training & Validation AUC")
452
axs[1].legend()
453
axs[1].grid(True)
454

455
plt.show()
456

457
"""
458
## Evaluation
459
"""
460

461
train_loss, train_acc, train_auc = model.evaluate(train_ds)
462
valid_loss, valid_acc, valid_auc = model.evaluate(valid_ds)
463

464
"""
465
Let's try to compare our model's performance to Yamnet's using one of Yamnet metrics (d-prime)
466
Yamnet achieved a d-prime value of 2.318.
467
Let's check our model's performance.
468
"""
469

470

471
# The following function calculates the d-prime score from the AUC
472
def d_prime(auc):
473
    standard_normal = stats.norm()
474
    d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
475
    return d_prime
476

477

478
print(
479
    "train d-prime: {0:.3f}, validation d-prime: {1:.3f}".format(
480
        d_prime(train_auc), d_prime(valid_auc)
481
    )
482
)
483

484
"""
485
We can see that the model achieves the following results:
486

487
Results    | Training  | Validation
488
-----------|-----------|------------
489
Accuracy   | 54%       | 51%
490
AUC        | 0.91      | 0.89
491
d-prime    | 1.882     | 1.740
492

493
"""
494

495
"""
496
## Confusion Matrix
497

498
Let's now plot the confusion matrix for the validation dataset.
499

500
The confusion matrix lets us see, for every class, not only how many samples were correctly classified,
501
but also which other classes were the samples confused with.
502

503
It allows us to calculate the precision and recall for every class.
504
"""
505

506
# Create x and y tensors
507
x_valid = None
508
y_valid = None
509

510
for x, y in iter(valid_ds):
511
    if x_valid is None:
512
        x_valid = x.numpy()
513
        y_valid = y.numpy()
514
    else:
515
        x_valid = np.concatenate((x_valid, x.numpy()), axis=0)
516
        y_valid = np.concatenate((y_valid, y.numpy()), axis=0)
517

518
# Generate predictions
519
y_pred = model.predict(x_valid)
520

521
# Calculate confusion matrix
522
confusion_mtx = tf.math.confusion_matrix(
523
    np.argmax(y_valid, axis=1), np.argmax(y_pred, axis=1)
524
)
525

526
# Plot the confusion matrix
527
plt.figure(figsize=(10, 8))
528
sns.heatmap(
529
    confusion_mtx, xticklabels=class_names, yticklabels=class_names, annot=True, fmt="g"
530
)
531
plt.xlabel("Prediction")
532
plt.ylabel("Label")
533
plt.title("Validation Confusion Matrix")
534
plt.show()
535

536
"""
537
## Precision & recall
538

539
For every class:
540

541
* Recall is the ratio of correctly classified samples i.e. it shows how many samples
542
of this specific class, the model is able to detect.
543
It is the ratio of diagonal elements to the sum of all elements in the row.
544
* Precision shows the accuracy of the classifier. It is the ratio of correctly predicted
545
samples among the ones classified as belonging to this class.
546
It is the ratio of diagonal elements to the sum of all elements in the column.
547
"""
548

549
for i, label in enumerate(class_names):
550
    precision = confusion_mtx[i, i] / np.sum(confusion_mtx[:, i])
551
    recall = confusion_mtx[i, i] / np.sum(confusion_mtx[i, :])
552
    print(
553
        "{0:15} Precision:{1:.2f}%; Recall:{2:.2f}%".format(
554
            label, precision * 100, recall * 100
555
        )
556
    )
557

558
"""
559
## Run inference on test data
560

561
Let's now run a test on a single audio file.
562
Let's check this example from [The Scottish Voice](https://www.thescottishvoice.org.uk/home/)
563

564
We will:
565

566
* Download the mp3 file.
567
* Convert it to a 16k wav file.
568
* Run the model on the wav file.
569
* Plot the results.
570
"""
571

572
filename = "audio-sample-Stuart"
573
url = "https://www.thescottishvoice.org.uk/files/cm/files/"
574

575
if os.path.exists(filename + ".wav") == False:
576
    print(f"Downloading {filename}.mp3 from {url}")
577
    command = f"wget {url}{filename}.mp3"
578
    os.system(command)
579

580
    print(f"Converting mp3 to wav and resampling to 16 kHZ")
581
    command = (
582
        f"ffmpeg -hide_banner -loglevel panic -y -i {filename}.mp3 -acodec "
583
        f"pcm_s16le -ac 1 -ar 16000 {filename}.wav"
584
    )
585
    os.system(command)
586

587
filename = filename + ".wav"
588

589

590
"""
591
The below function `yamnet_class_names_from_csv` was copied and very slightly changed
592
from this [Yamnet Notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb).
593
"""
594

595

596
def yamnet_class_names_from_csv(yamnet_class_map_csv_text):
597
    """Returns list of class names corresponding to score vector."""
598
    yamnet_class_map_csv = io.StringIO(yamnet_class_map_csv_text)
599
    yamnet_class_names = [
600
        name for (class_index, mid, name) in csv.reader(yamnet_class_map_csv)
601
    ]
602
    yamnet_class_names = yamnet_class_names[1:]  # Skip CSV header
603
    return yamnet_class_names
604

605

606
yamnet_class_map_path = yamnet_model.class_map_path().numpy()
607
yamnet_class_names = yamnet_class_names_from_csv(
608
    tf.io.read_file(yamnet_class_map_path).numpy().decode("utf-8")
609
)
610

611

612
def calculate_number_of_non_speech(scores):
613
    number_of_non_speech = tf.math.reduce_sum(
614
        tf.where(tf.math.argmax(scores, axis=1, output_type=tf.int32) != 0, 1, 0)
615
    )
616

617
    return number_of_non_speech
618

619

620
def filename_to_predictions(filename):
621
    # Load 16k audio wave
622
    audio_wav = load_16k_audio_wav(filename)
623

624
    # Get audio embeddings & scores.
625
    scores, embeddings, mel_spectrogram = yamnet_model(audio_wav)
626

627
    print(
628
        "Out of {} samples, {} are not speech".format(
629
            scores.shape[0], calculate_number_of_non_speech(scores)
630
        )
631
    )
632

633
    # Predict the output of the accent recognition model with embeddings as input
634
    predictions = model.predict(embeddings)
635

636
    return audio_wav, predictions, mel_spectrogram
637

638

639
"""
640
Let's run the model on the audio file:
641
"""
642

643
audio_wav, predictions, mel_spectrogram = filename_to_predictions(filename)
644

645
infered_class = class_names[predictions.mean(axis=0).argmax()]
646
print(f"The main accent is: {infered_class} English")
647

648
"""
649
Listen to the audio
650
"""
651

652
Audio(audio_wav, rate=16000)
653

654
"""
655
The below function was copied from this [Yamnet notebook](tinyurl.com/4a8xn7at) and adjusted to our need.
656

657
This function plots the following:
658

659
* Audio waveform
660
* Mel spectrogram
661
* Predictions for every time step
662
"""
663

664
plt.figure(figsize=(10, 6))
665

666
# Plot the waveform.
667
plt.subplot(3, 1, 1)
668
plt.plot(audio_wav)
669
plt.xlim([0, len(audio_wav)])
670

671
# Plot the log-mel spectrogram (returned by the model).
672
plt.subplot(3, 1, 2)
673
plt.imshow(
674
    mel_spectrogram.numpy().T, aspect="auto", interpolation="nearest", origin="lower"
675
)
676

677
# Plot and label the model output scores for the top-scoring classes.
678
mean_predictions = np.mean(predictions, axis=0)
679

680
top_class_indices = np.argsort(mean_predictions)[::-1]
681
plt.subplot(3, 1, 3)
682
plt.imshow(
683
    predictions[:, top_class_indices].T,
684
    aspect="auto",
685
    interpolation="nearest",
686
    cmap="gray_r",
687
)
688

689
# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
690
# values from the model documentation
691
patch_padding = (0.025 / 2) / 0.01
692
plt.xlim([-patch_padding - 0.5, predictions.shape[0] + patch_padding - 0.5])
693
# Label the top_N classes.
694
yticks = range(0, len(class_names), 1)
695
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
696
_ = plt.ylim(-0.5 + np.array([len(class_names), 0]))
697

698
Product

Resources

Company