<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>ASR on Mohit Dulani</title>
    <link>https://complete-dope.github.io/codex/tags/asr/</link>
    <description>Recent content in ASR on Mohit Dulani</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en</language>
    <lastBuildDate>Mon, 09 Mar 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://complete-dope.github.io/codex/tags/asr/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Audio models for generation, ASR, Trigger word etc</title>
      <link>https://complete-dope.github.io/codex/posts/audio-models/</link>
      <pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://complete-dope.github.io/codex/posts/audio-models/</guid>
      <description>&lt;h1 id=&#34;audio-modality&#34;&gt;Audio Modality&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Codec : piece of hardware/software that compresses / decompresses digital data to reduce file size&lt;/li&gt;
&lt;li&gt;FFT : fast fourier transform converts from time-amplitude domain to frequency-amplitude domain&lt;/li&gt;
&lt;li&gt;Resampling : an audio was recorded at 44100 Hz, but we want to resample it to 16000 Hz, so that is called resampling&lt;/li&gt;
&lt;li&gt;Spectrogram : converting from time-amplitude domain to frequency-amplitude domain&lt;/li&gt;
&lt;li&gt;Mels : mel scale, approximates how humans percieve pitch and in this freq axis is converted to mel scale&lt;/li&gt;
&lt;li&gt;Channel : no. of seperate audio  how many microphones were used to record the audio&lt;/li&gt;
&lt;li&gt;Mono channel : single sound , more like one headphone sound&lt;/li&gt;
&lt;li&gt;Stereo channel : surround sounds , more like two headphones sound (tv , songs , youtube vids )&lt;/li&gt;
&lt;li&gt;Sampling Rate : no. of sound point extracted from 1 sec of audio&lt;/li&gt;
&lt;li&gt;Waveform : so this is audio plot , on x-axis we have time , on y-axis we have decibels, pitch . This is what we hear and what music players shows&lt;/li&gt;
&lt;li&gt;Spectogram : so we convert from time domain to freq domain using FFT ,&lt;/li&gt;
&lt;li&gt;Mel-spectogram : Inspired from how humans listen to sound, and we listen on a logscale so therefore mel-spectogram is made for humans to listen&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Example :
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;SAMPLE_RATE&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;16000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;HOP_LENGTH&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;256&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# number of audio samples between spectrogram frames between 2 short time frame windows &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;N_FFT&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1024&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;MAX_MEL_FRAMES&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;512&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# no. of timestamps in a mel spectrogram &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;So in this audio is sampled at 16Khz
&lt;code&gt;N_FFT&lt;/code&gt; : tells in a 1 fft how many samples to analyse,&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
