Audio models for generation, ASR, Trigger word etc

Audio Modality Codec : piece of hardware/software that compresses / decompresses digital data to reduce file size FFT : fast fourier transform converts from time-amplitude domain to frequency-amplitude domain Resampling : an audio was recorded at 44100 Hz, but we want to resample it to 16000 Hz, so that is called resampling Spectrogram : converting from time-amplitude domain to frequency-amplitude domain Mels : mel scale, approximates how humans percieve pitch and in this freq axis is converted to mel scale Channel : no. of seperate audio how many microphones were used to record the audio Mono channel : single sound , more like one headphone sound Stereo channel : surround sounds , more like two headphones sound (tv , songs , youtube vids ) Sampling Rate : no. of sound point extracted from 1 sec of audio Waveform : so this is audio plot , on x-axis we have time , on y-axis we have decibels, pitch . This is what we hear and what music players shows Spectogram : so we convert from time domain to freq domain using FFT , Mel-spectogram : Inspired from how humans listen to sound, and we listen on a logscale so therefore mel-spectogram is made for humans to listen Example : SAMPLE_RATE = 16000 HOP_LENGTH = 256 # number of audio samples between spectrogram frames between 2 short time frame windows N_FFT = 1024 MAX_MEL_FRAMES = 512 # no. of timestamps in a mel spectrogram So in this audio is sampled at 16Khz N_FFT : tells in a 1 fft how many samples to analyse, ...

March 9, 2026 · 3 min · Mohit Dulani