Path: blob/master/examples/audio/ipynb/vocal_track_separation.ipynb
3508 views
Vocal Track Separation with Encoder-Decoder Architecture
Author: Joaquin Jimenez
Date created: 2024/12/10
Last modified: 2024/12/10
Description: Train a model to separate vocal tracks from music mixtures.
Introduction
In this tutorial, we build a vocal track separation model using an encoder-decoder architecture in Keras 3.
We train the model on the MUSDB18 dataset, which provides music mixtures and isolated tracks for drums, bass, other, and vocals.
Key concepts covered:
Audio data preprocessing using the Short-Time Fourier Transform (STFT).
Audio data augmentation techniques.
Implementing custom encoders and decoders specialized for audio data.
Defining appropriate loss functions and metrics for audio source separation tasks.
The model architecture is derived from the TFC_TDF_Net model described in:
W. Choi, M. Kim, J. Chung, D. Lee, and S. Jung, “Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation,” in the 21st International Society for Music Information Retrieval Conference, 2020.
For reference code, see: GitHub: ws-choi/ISMIR2020_U_Nets_SVS.
The data processing and model training routines are partly derived from: ZFTurbo/Music-Source-Separation-Training.
Setup
Import and install all the required dependencies.
Configuration
The following constants define configuration parameters for audio processing and model training, including dataset paths, audio chunk sizes, Short-Time Fourier Transform (STFT) parameters, and training hyperparameters.
MUSDB18 Dataset
The MUSDB18 dataset is a standard benchmark for music source separation, containing 150 full-length music tracks along with isolated drums, bass, other, and vocals. The dataset is stored in .mp4 format, and each .mp4 file includes multiple audio streams (mixture and individual tracks).
Download and Conversion
The following utility function downloads MUSDB18 and converts its .mp4 files to .wav files for each instrument track, resampled to 16 kHz.
Custom Dataset
We define a custom dataset class to generate random audio chunks and their corresponding labels. The dataset does the following:
Selects a random chunk from a random song and instrument.
Applies optional data augmentations.
Combines isolated tracks to form new synthetic mixtures.
Prepares features (mixtures) and labels (vocals) for training.
This approach allows creating an effectively infinite variety of training examples through randomization and augmentation.
Visualize a Sample
Let's visualize a random mixed audio chunk and its corresponding isolated vocals. This helps to understand the nature of the preprocessed input data.
Model
Preprocessing
The model operates on STFT representations rather than raw audio. We define a preprocessing model to compute STFT and a corresponding inverse transform (iSTFT).
Model Architecture
The model uses a custom encoder-decoder architecture with Time-Frequency Convolution (TFC) and Time-Distributed Fully Connected (TDF) blocks. They are grouped into a TimeFrequencyTransformBlock
, i.e. "TFC_TDF" in the original paper by Choi et al.
We then define an encoder-decoder network with multiple scales. Each encoder scale applies TFC_TDF blocks followed by downsampling, while decoder scales apply TFC_TDF blocks over the concatenation of upsampled features and associated encoder outputs.
Loss and Metrics
We define:
spectral_loss
: Mean absolute error in STFT domain.sdr
: Signal-to-Distortion Ratio, a common source separation metric.
Training
Visualize Model Architecture
Compile and Train the Model
Evaluation
Evaluate the model on the validation dataset and visualize predicted vocals.
Conclusion
We built and trained a vocal track separation model using an encoder-decoder architecture with custom blocks applied to the MUSDB18 dataset. We demonstrated STFT-based preprocessing, data augmentation, and a source separation metric (SDR).
Next steps:
Train for more epochs and refine hyperparameters.
Separate multiple instruments simultaneously.
Enhance the model to handle instruments not present in the mixture.