CoCalc -- melgan_spectrogram

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/audio/melgan_spectrogram_inversion.py
³⁵⁰⁷ views
1
"""
2
Title: MelGAN-based spectrogram inversion using feature matching
3
Author: [Darshan Deshpande](https://twitter.com/getdarshan)
4
Date created: 02/09/2021
5
Last modified: 15/09/2021
6
Description: Inversion of audio from mel-spectrograms using the MelGAN architecture and feature matching.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
Autoregressive vocoders have been ubiquitous for a majority of the history of speech processing,
14
but for most of their existence they have lacked parallelism.
15
[MelGAN](https://arxiv.org/abs/1910.06711) is a
16
non-autoregressive, fully convolutional vocoder architecture used for purposes ranging
17
from spectral inversion and speech enhancement to present-day state-of-the-art
18
speech synthesis when used as a decoder
19
with models like Tacotron2 or FastSpeech that convert text to mel spectrograms.
20

21
In this tutorial, we will have a look at the MelGAN architecture and how it can achieve
22
fast spectral inversion, i.e. conversion of spectrograms to audio waves. The MelGAN
23
implemented in this tutorial is similar to the original implementation with only the
24
difference of method of padding for convolutions where we will use 'same' instead of
25
reflect padding.
26
"""
27

28
"""
29
## Importing and Defining Hyperparameters
30
"""
31

32
"""shell
33
pip install -qqq tensorflow_addons
34
pip install -qqq tensorflow-io
35
"""
36

37
import tensorflow as tf
38
import tensorflow_io as tfio
39
from tensorflow import keras
40
from tensorflow.keras import layers
41
from tensorflow_addons import layers as addon_layers
42

43
# Setting logger level to avoid input shape warnings
44
tf.get_logger().setLevel("ERROR")
45

46
# Defining hyperparameters
47

48
DESIRED_SAMPLES = 8192
49
LEARNING_RATE_GEN = 1e-5
50
LEARNING_RATE_DISC = 1e-6
51
BATCH_SIZE = 16
52

53
mse = keras.losses.MeanSquaredError()
54
mae = keras.losses.MeanAbsoluteError()
55

56
"""
57
## Loading the Dataset
58

59
This example uses the [LJSpeech dataset](https://keithito.com/LJ-Speech-Dataset/).
60

61
The LJSpeech dataset is primarily used for text-to-speech and consists of 13,100 discrete
62
speech samples taken from 7 non-fiction books, having a total length of approximately 24
63
hours. The MelGAN training is only concerned with the audio waves so we process only the
64
WAV files and ignore the audio annotations.
65
"""
66

67
"""shell
68
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
69
tar -xf /content/LJSpeech-1.1.tar.bz2
70
"""
71

72
"""
73
We create a `tf.data.Dataset` to load and process the audio files on the fly.
74
The `preprocess()` function takes the file path as input and returns two instances of the
75
wave, one for input and one as the ground truth for comparison. The input wave will be
76
mapped to a spectrogram using the custom `MelSpec` layer as shown later in this example.
77
"""
78

79
# Splitting the dataset into training and testing splits
80
wavs = tf.io.gfile.glob("LJSpeech-1.1/wavs/*.wav")
81
print(f"Number of audio files: {len(wavs)}")
82

83

84
# Mapper function for loading the audio. This function returns two instances of the wave
85
def preprocess(filename):
86
    audio = tf.audio.decode_wav(tf.io.read_file(filename), 1, DESIRED_SAMPLES).audio
87
    return audio, audio
88

89

90
# Create tf.data.Dataset objects and apply preprocessing
91
train_dataset = tf.data.Dataset.from_tensor_slices((wavs,))
92
train_dataset = train_dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
93

94
"""
95
## Defining custom layers for MelGAN
96

97
The MelGAN architecture consists of 3 main modules:
98

99
1. The residual block
100
2. Dilated convolutional block
101
3. Discriminator block
102

103
![MelGAN](https://i.imgur.com/ZdxwzPG.png)
104
"""
105

106
"""
107
Since the network takes a mel-spectrogram as input, we will create an additional custom
108
layer
109
which can convert the raw audio wave to a spectrogram on-the-fly. We use the raw audio
110
tensor from `train_dataset` and map it to a mel-spectrogram using the `MelSpec` layer
111
below.
112
"""
113

114
# Custom keras layer for on-the-fly audio to spectrogram conversion
115

116

117
class MelSpec(layers.Layer):
118
    def __init__(
119
        self,
120
        frame_length=1024,
121
        frame_step=256,
122
        fft_length=None,
123
        sampling_rate=22050,
124
        num_mel_channels=80,
125
        freq_min=125,
126
        freq_max=7600,
127
        **kwargs,
128
    ):
129
        super().__init__(**kwargs)
130
        self.frame_length = frame_length
131
        self.frame_step = frame_step
132
        self.fft_length = fft_length
133
        self.sampling_rate = sampling_rate
134
        self.num_mel_channels = num_mel_channels
135
        self.freq_min = freq_min
136
        self.freq_max = freq_max
137
        # Defining mel filter. This filter will be multiplied with the STFT output
138
        self.mel_filterbank = tf.signal.linear_to_mel_weight_matrix(
139
            num_mel_bins=self.num_mel_channels,
140
            num_spectrogram_bins=self.frame_length // 2 + 1,
141
            sample_rate=self.sampling_rate,
142
            lower_edge_hertz=self.freq_min,
143
            upper_edge_hertz=self.freq_max,
144
        )
145

146
    def call(self, audio, training=True):
147
        # We will only perform the transformation during training.
148
        if training:
149
            # Taking the Short Time Fourier Transform. Ensure that the audio is padded.
150
            # In the paper, the STFT output is padded using the 'REFLECT' strategy.
151
            stft = tf.signal.stft(
152
                tf.squeeze(audio, -1),
153
                self.frame_length,
154
                self.frame_step,
155
                self.fft_length,
156
                pad_end=True,
157
            )
158

159
            # Taking the magnitude of the STFT output
160
            magnitude = tf.abs(stft)
161

162
            # Multiplying the Mel-filterbank with the magnitude and scaling it using the db scale
163
            mel = tf.matmul(tf.square(magnitude), self.mel_filterbank)
164
            log_mel_spec = tfio.audio.dbscale(mel, top_db=80)
165
            return log_mel_spec
166
        else:
167
            return audio
168

169
    def get_config(self):
170
        config = super().get_config()
171
        config.update(
172
            {
173
                "frame_length": self.frame_length,
174
                "frame_step": self.frame_step,
175
                "fft_length": self.fft_length,
176
                "sampling_rate": self.sampling_rate,
177
                "num_mel_channels": self.num_mel_channels,
178
                "freq_min": self.freq_min,
179
                "freq_max": self.freq_max,
180
            }
181
        )
182
        return config
183

184

185
"""
186
The residual convolutional block extensively uses dilations and has a total receptive
187
field of 27 timesteps per block. The dilations must grow as a power of the `kernel_size`
188
to ensure reduction of hissing noise in the output. The network proposed by the paper is
189
as follows:
190

191
![ConvBlock](https://i.imgur.com/sFnnsCll.jpg)
192
"""
193

194
# Creating the residual stack block
195

196

197
def residual_stack(input, filters):
198
    """Convolutional residual stack with weight normalization.
199

200
    Args:
201
        filters: int, determines filter size for the residual stack.
202

203
    Returns:
204
        Residual stack output.
205
    """
206
    c1 = addon_layers.WeightNormalization(
207
        layers.Conv1D(filters, 3, dilation_rate=1, padding="same"), data_init=False
208
    )(input)
209
    lrelu1 = layers.LeakyReLU()(c1)
210
    c2 = addon_layers.WeightNormalization(
211
        layers.Conv1D(filters, 3, dilation_rate=1, padding="same"), data_init=False
212
    )(lrelu1)
213
    add1 = layers.Add()([c2, input])
214

215
    lrelu2 = layers.LeakyReLU()(add1)
216
    c3 = addon_layers.WeightNormalization(
217
        layers.Conv1D(filters, 3, dilation_rate=3, padding="same"), data_init=False
218
    )(lrelu2)
219
    lrelu3 = layers.LeakyReLU()(c3)
220
    c4 = addon_layers.WeightNormalization(
221
        layers.Conv1D(filters, 3, dilation_rate=1, padding="same"), data_init=False
222
    )(lrelu3)
223
    add2 = layers.Add()([add1, c4])
224

225
    lrelu4 = layers.LeakyReLU()(add2)
226
    c5 = addon_layers.WeightNormalization(
227
        layers.Conv1D(filters, 3, dilation_rate=9, padding="same"), data_init=False
228
    )(lrelu4)
229
    lrelu5 = layers.LeakyReLU()(c5)
230
    c6 = addon_layers.WeightNormalization(
231
        layers.Conv1D(filters, 3, dilation_rate=1, padding="same"), data_init=False
232
    )(lrelu5)
233
    add3 = layers.Add()([c6, add2])
234

235
    return add3
236

237

238
"""
239
Each convolutional block uses the dilations offered by the residual stack
240
and upsamples the input data by the `upsampling_factor`.
241
"""
242

243
# Dilated convolutional block consisting of the Residual stack
244

245

246
def conv_block(input, conv_dim, upsampling_factor):
247
    """Dilated Convolutional Block with weight normalization.
248

249
    Args:
250
        conv_dim: int, determines filter size for the block.
251
        upsampling_factor: int, scale for upsampling.
252

253
    Returns:
254
        Dilated convolution block.
255
    """
256
    conv_t = addon_layers.WeightNormalization(
257
        layers.Conv1DTranspose(conv_dim, 16, upsampling_factor, padding="same"),
258
        data_init=False,
259
    )(input)
260
    lrelu1 = layers.LeakyReLU()(conv_t)
261
    res_stack = residual_stack(lrelu1, conv_dim)
262
    lrelu2 = layers.LeakyReLU()(res_stack)
263
    return lrelu2
264

265

266
"""
267
The discriminator block consists of convolutions and downsampling layers. This block is
268
essential for the implementation of the feature matching technique.
269

270
Each discriminator outputs a list of feature maps that will be compared during training
271
to compute the feature matching loss.
272
"""
273

274

275
def discriminator_block(input):
276
    conv1 = addon_layers.WeightNormalization(
277
        layers.Conv1D(16, 15, 1, "same"), data_init=False
278
    )(input)
279
    lrelu1 = layers.LeakyReLU()(conv1)
280
    conv2 = addon_layers.WeightNormalization(
281
        layers.Conv1D(64, 41, 4, "same", groups=4), data_init=False
282
    )(lrelu1)
283
    lrelu2 = layers.LeakyReLU()(conv2)
284
    conv3 = addon_layers.WeightNormalization(
285
        layers.Conv1D(256, 41, 4, "same", groups=16), data_init=False
286
    )(lrelu2)
287
    lrelu3 = layers.LeakyReLU()(conv3)
288
    conv4 = addon_layers.WeightNormalization(
289
        layers.Conv1D(1024, 41, 4, "same", groups=64), data_init=False
290
    )(lrelu3)
291
    lrelu4 = layers.LeakyReLU()(conv4)
292
    conv5 = addon_layers.WeightNormalization(
293
        layers.Conv1D(1024, 41, 4, "same", groups=256), data_init=False
294
    )(lrelu4)
295
    lrelu5 = layers.LeakyReLU()(conv5)
296
    conv6 = addon_layers.WeightNormalization(
297
        layers.Conv1D(1024, 5, 1, "same"), data_init=False
298
    )(lrelu5)
299
    lrelu6 = layers.LeakyReLU()(conv6)
300
    conv7 = addon_layers.WeightNormalization(
301
        layers.Conv1D(1, 3, 1, "same"), data_init=False
302
    )(lrelu6)
303
    return [lrelu1, lrelu2, lrelu3, lrelu4, lrelu5, lrelu6, conv7]
304

305

306
"""
307
### Create the generator
308
"""
309

310

311
def create_generator(input_shape):
312
    inp = keras.Input(input_shape)
313
    x = MelSpec()(inp)
314
    x = layers.Conv1D(512, 7, padding="same")(x)
315
    x = layers.LeakyReLU()(x)
316
    x = conv_block(x, 256, 8)
317
    x = conv_block(x, 128, 8)
318
    x = conv_block(x, 64, 2)
319
    x = conv_block(x, 32, 2)
320
    x = addon_layers.WeightNormalization(
321
        layers.Conv1D(1, 7, padding="same", activation="tanh")
322
    )(x)
323
    return keras.Model(inp, x)
324

325

326
# We use a dynamic input shape for the generator since the model is fully convolutional
327
generator = create_generator((None, 1))
328
generator.summary()
329

330
"""
331
### Create the discriminator
332
"""
333

334

335
def create_discriminator(input_shape):
336
    inp = keras.Input(input_shape)
337
    out_map1 = discriminator_block(inp)
338
    pool1 = layers.AveragePooling1D()(inp)
339
    out_map2 = discriminator_block(pool1)
340
    pool2 = layers.AveragePooling1D()(pool1)
341
    out_map3 = discriminator_block(pool2)
342
    return keras.Model(inp, [out_map1, out_map2, out_map3])
343

344

345
# We use a dynamic input shape for the discriminator
346
# This is done because the input shape for the generator is unknown
347
discriminator = create_discriminator((None, 1))
348

349
discriminator.summary()
350

351
"""
352
## Defining the loss functions
353

354
**Generator Loss**
355

356
The generator architecture uses a combination of two losses
357

358
1. Mean Squared Error:
359

360
This is the standard MSE generator loss calculated between ones and the outputs from the
361
discriminator with _N_ layers.
362

363
<p align="center">
364
<img src="https://i.imgur.com/dz4JS3I.png" width=300px;></img>
365
</p>
366

367
2. Feature Matching Loss:
368

369
This loss involves extracting the outputs of every layer from the discriminator for both
370
the generator and ground truth and compare each layer output _k_ using Mean Absolute Error.
371

372
<p align="center">
373
<img src="https://i.imgur.com/gEpSBar.png" width=400px;></img>
374
</p>
375

376
**Discriminator Loss**
377

378
The discriminator uses the Mean Absolute Error and compares the real data predictions
379
with ones and generated predictions with zeros.
380

381
<p align="center">
382
<img src="https://i.imgur.com/bbEnJ3t.png" width=425px;></img>
383
</p>
384
"""
385

386
# Generator loss
387

388

389
def generator_loss(real_pred, fake_pred):
390
    """Loss function for the generator.
391

392
    Args:
393
        real_pred: Tensor, output of the ground truth wave passed through the discriminator.
394
        fake_pred: Tensor, output of the generator prediction passed through the discriminator.
395

396
    Returns:
397
        Loss for the generator.
398
    """
399
    gen_loss = []
400
    for i in range(len(fake_pred)):
401
        gen_loss.append(mse(tf.ones_like(fake_pred[i][-1]), fake_pred[i][-1]))
402

403
    return tf.reduce_mean(gen_loss)
404

405

406
def feature_matching_loss(real_pred, fake_pred):
407
    """Implements the feature matching loss.
408

409
    Args:
410
        real_pred: Tensor, output of the ground truth wave passed through the discriminator.
411
        fake_pred: Tensor, output of the generator prediction passed through the discriminator.
412

413
    Returns:
414
        Feature Matching Loss.
415
    """
416
    fm_loss = []
417
    for i in range(len(fake_pred)):
418
        for j in range(len(fake_pred[i]) - 1):
419
            fm_loss.append(mae(real_pred[i][j], fake_pred[i][j]))
420

421
    return tf.reduce_mean(fm_loss)
422

423

424
def discriminator_loss(real_pred, fake_pred):
425
    """Implements the discriminator loss.
426

427
    Args:
428
        real_pred: Tensor, output of the ground truth wave passed through the discriminator.
429
        fake_pred: Tensor, output of the generator prediction passed through the discriminator.
430

431
    Returns:
432
        Discriminator Loss.
433
    """
434
    real_loss, fake_loss = [], []
435
    for i in range(len(real_pred)):
436
        real_loss.append(mse(tf.ones_like(real_pred[i][-1]), real_pred[i][-1]))
437
        fake_loss.append(mse(tf.zeros_like(fake_pred[i][-1]), fake_pred[i][-1]))
438

439
    # Calculating the final discriminator loss after scaling
440
    disc_loss = tf.reduce_mean(real_loss) + tf.reduce_mean(fake_loss)
441
    return disc_loss
442

443

444
"""
445
Defining the MelGAN model for training.
446
This subclass overrides the `train_step()` method to implement the training logic.
447
"""
448

449

450
class MelGAN(keras.Model):
451
    def __init__(self, generator, discriminator, **kwargs):
452
        """MelGAN trainer class
453

454
        Args:
455
            generator: keras.Model, Generator model
456
            discriminator: keras.Model, Discriminator model
457
        """
458
        super().__init__(**kwargs)
459
        self.generator = generator
460
        self.discriminator = discriminator
461

462
    def compile(
463
        self,
464
        gen_optimizer,
465
        disc_optimizer,
466
        generator_loss,
467
        feature_matching_loss,
468
        discriminator_loss,
469
    ):
470
        """MelGAN compile method.
471

472
        Args:
473
            gen_optimizer: keras.optimizer, optimizer to be used for training
474
            disc_optimizer: keras.optimizer, optimizer to be used for training
475
            generator_loss: callable, loss function for generator
476
            feature_matching_loss: callable, loss function for feature matching
477
            discriminator_loss: callable, loss function for discriminator
478
        """
479
        super().compile()
480

481
        # Optimizers
482
        self.gen_optimizer = gen_optimizer
483
        self.disc_optimizer = disc_optimizer
484

485
        # Losses
486
        self.generator_loss = generator_loss
487
        self.feature_matching_loss = feature_matching_loss
488
        self.discriminator_loss = discriminator_loss
489

490
        # Trackers
491
        self.gen_loss_tracker = keras.metrics.Mean(name="gen_loss")
492
        self.disc_loss_tracker = keras.metrics.Mean(name="disc_loss")
493

494
    def train_step(self, batch):
495
        x_batch_train, y_batch_train = batch
496

497
        with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
498
            # Generating the audio wave
499
            gen_audio_wave = generator(x_batch_train, training=True)
500

501
            # Generating the features using the discriminator
502
            real_pred = discriminator(y_batch_train)
503
            fake_pred = discriminator(gen_audio_wave)
504

505
            # Calculating the generator losses
506
            gen_loss = generator_loss(real_pred, fake_pred)
507
            fm_loss = feature_matching_loss(real_pred, fake_pred)
508

509
            # Calculating final generator loss
510
            gen_fm_loss = gen_loss + 10 * fm_loss
511

512
            # Calculating the discriminator losses
513
            disc_loss = discriminator_loss(real_pred, fake_pred)
514

515
        # Calculating and applying the gradients for generator and discriminator
516
        grads_gen = gen_tape.gradient(gen_fm_loss, generator.trainable_weights)
517
        grads_disc = disc_tape.gradient(disc_loss, discriminator.trainable_weights)
518
        gen_optimizer.apply_gradients(zip(grads_gen, generator.trainable_weights))
519
        disc_optimizer.apply_gradients(zip(grads_disc, discriminator.trainable_weights))
520

521
        self.gen_loss_tracker.update_state(gen_fm_loss)
522
        self.disc_loss_tracker.update_state(disc_loss)
523

524
        return {
525
            "gen_loss": self.gen_loss_tracker.result(),
526
            "disc_loss": self.disc_loss_tracker.result(),
527
        }
528

529

530
"""
531
## Training
532

533
The paper suggests that the training with dynamic shapes takes around 400,000 steps (~500
534
epochs). For this example, we will run it only for a single epoch (819 steps).
535
Longer training time (greater than 300 epochs) will almost certainly provide better results.
536
"""
537

538
gen_optimizer = keras.optimizers.Adam(
539
    LEARNING_RATE_GEN, beta_1=0.5, beta_2=0.9, clipnorm=1
540
)
541
disc_optimizer = keras.optimizers.Adam(
542
    LEARNING_RATE_DISC, beta_1=0.5, beta_2=0.9, clipnorm=1
543
)
544

545
# Start training
546
generator = create_generator((None, 1))
547
discriminator = create_discriminator((None, 1))
548

549
mel_gan = MelGAN(generator, discriminator)
550
mel_gan.compile(
551
    gen_optimizer,
552
    disc_optimizer,
553
    generator_loss,
554
    feature_matching_loss,
555
    discriminator_loss,
556
)
557
mel_gan.fit(
558
    train_dataset.shuffle(200).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE), epochs=1
559
)
560

561
"""
562
## Testing the model
563

564
The trained model can now be used for real time text-to-speech translation tasks.
565
To test how fast the MelGAN inference can be, let us take a sample audio mel-spectrogram
566
and convert it. Note that the actual model pipeline will not include the `MelSpec` layer
567
and hence this layer will be disabled during inference. The inference input will be a
568
mel-spectrogram processed similar to the `MelSpec` layer configuration.
569

570
For testing this, we will create a randomly uniformly distributed tensor to simulate the
571
behavior of the inference pipeline.
572
"""
573

574
# Sampling a random tensor to mimic a batch of 128 spectrograms of shape [50, 80]
575
audio_sample = tf.random.uniform([128, 50, 80])
576

577
"""
578
Timing the inference speed of a single sample. Running this, you can see that the average
579
inference time per spectrogram ranges from 8 milliseconds to 10 milliseconds on a K80 GPU which is
580
pretty fast.
581
"""
582
pred = generator.predict(audio_sample, batch_size=32, verbose=1)
583
"""
584
## Conclusion
585

586
The MelGAN is a highly effective architecture for spectral inversion that has a Mean
587
Opinion Score (MOS) of 3.61 that  considerably outperforms the Griffin
588
Lim algorithm having a MOS of just 1.57. In contrast with this, the MelGAN compares with
589
the state-of-the-art WaveGlow and WaveNet architectures on text-to-speech and speech
590
enhancement tasks on
591
the LJSpeech and VCTK datasets <sup>[1]</sup>.
592

593
This tutorial highlights:
594

595
1. The advantages of using dilated convolutions that grow with the filter size
596
2. Implementation of a custom layer for on-the-fly conversion of audio waves to
597
mel-spectrograms
598
3. Effectiveness of using the feature matching loss function for training GAN generators.
599

600
Further reading
601

602
1. [MelGAN paper](https://arxiv.org/abs/1910.06711) (Kundan Kumar et al.) to
603
understand the reasoning behind the architecture and training process
604
2. For in-depth understanding of the feature matching loss, you can refer to [Improved
605
Techniques for Training GANs](https://arxiv.org/abs/1606.03498) (Tim Salimans et
606
al.).
607
"""
608

609
Product

Resources

Company