Audio models for generation, ASR, Trigger word etc

Mon, 09 Mar 2026 00:00:00 +0000

Audio Modality

Codec : piece of hardware/software that compresses / decompresses digital data to reduce file size
FFT : fast fourier transform converts from time-amplitude domain to frequency-amplitude domain
Resampling : an audio was recorded at 44100 Hz, but we want to resample it to 16000 Hz, so that is called resampling
Spectrogram : converting from time-amplitude domain to frequency-amplitude domain
Mels : mel scale, approximates how humans percieve pitch and in this freq axis is converted to mel scale
Channel : no. of seperate audio how many microphones were used to record the audio
Mono channel : single sound , more like one headphone sound
Stereo channel : surround sounds , more like two headphones sound (tv , songs , youtube vids )
Sampling Rate : no. of sound point extracted from 1 sec of audio
Waveform : so this is audio plot , on x-axis we have time , on y-axis we have decibels, pitch . This is what we hear and what music players shows
Spectogram : so we convert from time domain to freq domain using FFT ,
Mel-spectogram : Inspired from how humans listen to sound, and we listen on a logscale so therefore mel-spectogram is made for humans to listen

Example :

SAMPLE_RATE = 16000
HOP_LENGTH = 256 # number of audio samples between spectrogram frames between 2 short time frame windows 
N_FFT = 1024 
MAX_MEL_FRAMES = 512 # no. of timestamps in a mel spectrogram

So in this audio is sampled at 16Khz N_FFT : tells in a 1 fft how many samples to analyse,

ASR on Mohit Dulani

Audio models for generation, ASR, Trigger word etc

Audio Modality