Audio Modality

Codec : piece of hardware/software that compresses / decompresses digital data to reduce file size
FFT : fast fourier transform converts from time-amplitude domain to frequency-amplitude domain
Resampling : an audio was recorded at 44100 Hz, but we want to resample it to 16000 Hz, so that is called resampling
Spectrogram : converting from time-amplitude domain to frequency-amplitude domain
Mels : mel scale, approximates how humans percieve pitch and in this freq axis is converted to mel scale
Channel : no. of seperate audio how many microphones were used to record the audio
Mono channel : single sound , more like one headphone sound
Stereo channel : surround sounds , more like two headphones sound (tv , songs , youtube vids )
Sampling Rate : no. of sound point extracted from 1 sec of audio
Waveform : so this is audio plot , on x-axis we have time , on y-axis we have decibels, pitch . This is what we hear and what music players shows
Spectogram : so we convert from time domain to freq domain using FFT ,
Mel-spectogram : Inspired from how humans listen to sound, and we listen on a logscale so therefore mel-spectogram is made for humans to listen

Example :

SAMPLE_RATE = 16000
HOP_LENGTH = 256 # number of audio samples between spectrogram frames between 2 short time frame windows 
N_FFT = 1024 
MAX_MEL_FRAMES = 512 # no. of timestamps in a mel spectrogram

So in this audio is sampled at 16Khz N_FFT : tells in a 1 fft how many samples to analyse,

16000 samples = 1 second then, 1024 samples / 16000 ≈ 0.064 sec , so each fft is analysing 64ms of audio

1 FFT computation = 1 spectrogram frame Hop length tells how many samples to skip before we start analysing 2nd fft 256 / 16000 = 0.016 sec , 16ms

Frame 1
[0 ms -------- 64 ms]

Frame 2
     [16 ms -------- 80 ms]

Frame 3
          [32 ms -------- 96 ms]

No. of spectogram frames : frames ≈ (num_samples - N_FFT) / HOP_LENGTH + 1

ASR models

Input : waveform
Output : text
Loss : CTC loss
Dataset : (Audio, text) pairs

In this we have input as waveform and we try to get features out from this waveform to a latent space (using an CNN model) then we pass that features to transformer encoder for training to get samples out.

Generation model

Input : text
Output : spectrogram
Loss : original spectrogram with predicted spectrogram
Vocoder : converts from spectrogram to waveform

so this is more like a seq to seq problem , where in input we have text and in output we spectrogram values for that input

Architecture includes an full transformer model (encoder , decoder model) . This can be done using diffusion also

Audio Modality#

ASR models#

Generation model#

Audio Modality

ASR models

Generation model