Signal Processing / Audio ML
Mel Spectrogram —
Explained From First Principles
The Problem With Raw FFT Output
the main problem with raw FFT is it gives us the output in bins, which are linearly
spaced. if you have used N samples, then the FFT gives you
N samples out of which half of them are mirrored. But bins
0 and N/2 dont (mirror rule is bin k mirrors
bin N-k, so 0 and N/2 bins dont have a mirror
brother).
imagine 0 is sitting at the top of the circle, and
N/2 would be at the bottom of the circle. when you fold the circle
along this vertical diameter, 0 and N/2 are the ones that
dont have a mirror. so in total the part that we need is N/2 + 1 bins
(0, 1, 2, ..., N/2). but hey, these bins are linearly spaced in terms of freq.
bin 0 → bin 12 has the same gap as bin 100 → bin 112. but our ears are not like that. we perceive the difference in the lower end much bigger than the same difference in the higher end of the freq spectrum. what i mean is the jump from 100–120 Hz is very big to our ears compared to the jump from 1500–1520 Hz, although the difference is the same in both cases (20 Hz).
The Mel Scale — Matching Human Perception
The mel scale is exactly the tool that solves this problem. it was a scale that was proposed in 1937 by Stanley Smith Stevens, John Volkmann, and Edwin Newman. it was designed to be a perceptual scale of pitch which listeners judged to be equal in distance from one another.
so, this is the scale that we understand. if you look up ‘Pitch vs freq graph’ online, then you get a logarithmic graph. so that graph means for linear increase in pitch, the freq should increase logarithmically (obviously). along this line, the formula for conversion from freq to mel goes like this:
you can figure the inverse (from mel to hz) yourself — but here it is:
The Filterbank — Compressing 1025 Bins Into Useful Bands
so to convert any basic spectrogram (the one you get from STFT) to mel spectrogram, first you need to construct mel filterbanks.
so you start with the min and max freq and convert them to min and max mel points
using the above formula. then you decide how many mel points you would need for your
case. usually i use 40, 70, 120. lets call it n_mels. after that, you
would want to create n_mels + 2 equally spaced mel points between min
and max mel points, both inclusive. n_mels + 2 because we are gonna
form a triangle filter (left, peak and right). each filter needs three consecutive
mel points. out of (n_mels + 2) points, each of the first
n_mels points would act as left for a filter, thus giving us
n_mels filters. we chose triangular filters because this helps
efficiently averaging the energy within a spectrum.
given that we already formed our n_mels equally spaced mel points, you
can convert them into their respective freq points. using numpy, it would be a
simple formula that you must have figured out in the previous section. now you can
convert the freq points to bin indices to see the gap for yourself. for a clip with
a given sr (sample rate), lets say you pick a frame having
N samples. then the bin spacing is sr/N. for any bin
k, the respective freq is:
its that simple. use this formula to get the bin indices list.
so yeah, since we are dealing with triangles, and we need
n_mels filters, we could do that by creating a matrix of dim
n_mels × 1025 (if you used a 2048-point FFT, the N/2 + 1 deal).
each row is a filter. for example, to make the third filter, we use the 3rd, 4th and
5th points from the (n_mels + 2) points that we created earlier:
- left → 3rd point
- peak → 4th point
- right → 5th point
we want a triangular filter, so for any point j between left and peak (excluding peak), we give it a weight:
for any point j between peak and right (both inclusive):
as you might see, the weights go from 0 at left till 1 at peak and from 1 at peak to
0 at right. this is the triangle that we talked about. we do the same for all the
n_mels filters we talked about.
Applying the Filters — One Number Per Band
once we constructed the filters. its now the time to apply it to our spectrogram. this is a simple step, but has a deeper meaning.
our spectrogram will have shape of (n_frames × 1025) and our
filterbank now has a shape of (n_mels × 1025). to apply filter to
all the 1025 bins we multiply both the matrices:
this would result in a matrix whose shape is (n_mels × n_frames).
what this means is we compressed the 1025 bins to just n_mels. going by
matrix multiplication, this means we multiplied each of the 1025 values in a filter
with 1025 bins, and we did a weighted sum operation. the single number result of
this weighted sum operation represents the energy in that band.
if you thought for a moment, ‘wait arent these filters of different width?’, yep, you’re right and having right intuitions. imagine having rainfall across the city. you place a small bucket at one end of the city and HUGE bucket at the other end. after the rainfall, if you check which bucket holds most water, which bucket would it be? obviously the one with the big size. just because the bucket is big, it holds more water. likewise, just because the filter is wider doesnt mean more energy is there. we need to normalise this, so we divide by the width of the filter. that’s it, we constructed the filters and applied it.
Log Compression — Why We Take log10
there is this concept of dynamic range. it refers to difference between the loudest and silent part of a clip. technically, its the ratio between the largest and smallest measurable values of a signal. like, a whisper having a very small amplitude and jet engine having a very big amplitude. the problem is this ratio could be billions to one, so we convert them to logarithmic scale. otherwise dealing with such large and small numbers at the same time isnt ideal.
taking log stabilizes the features and reduces the variances. if the intensity of the sound doubles, our ears dont perceive them twice as loud. we perceive loudness logarithmically. weber-fechner idea is perceived loudness is directly proportional to log(intensity).
also for computing reasons we add 1e-9 to x when calculating log(x). so
we actually compute:
this is to avoid the log(0) problem. as x approaches 0+,
log(x) approaches −∞. so what we essentially do is when there is zero
sound, we still take it as very very tiny sound instead of a pure zero.
The Full Pipeline (Summary)
Audio signal
→ STFT (all frames) shape: (n_frames, n_bins)
→ Mel filterbank shape: (n_bins, n_mels) [matrix multiply]
→ Mel spectrogram shape: (n_frames, n_mels)
→ Log compression (dB) shape: (n_frames, n_mels)
→ Plot: time × mel bands
- STFT is the algorithm that convert the signal from time domain to frequency domain. we go from amplitude vs time to frequency vs time. this is by taking numerous overlapping frames and calculating the fft on each one of them.
- Mel filterbank is the tool that we use to apply mel filters so that the scale matches our hearing scale.
- Mel spectrogram is just applying mel filterbank to normal spectrogram (output of STFT). but the implication is huge, this is what we feed to audio ml pipelines.
- Log compression is to make sure that we dont work with significantly different scales of ranges. fyi, the threshold of our hearing is 10−12 W/m² and painful sound is 1 W/m², a ratio of 1012.
What I Got Wrong (Honest Section)
- the time vs frequency tradeoff took me the longest to digest and understand. this is fundamental nature of sound. getting super clarity of frequency (means more bins) meant you get a blurry vision of time. and vice versa too.
-
i realised just
np.abs(fft)is not energy. rather it isnp.abs(fft)**2. although minor mistake, it differs in having a wrong fundamental. - understanding the triangular filters was hell for me, without proper materials. i spent like 6 hours spread across 2-3 days on this one.
- the why of ‘N/2 + 1’ was also hard for me because i imagined it the wrong way.

Comments
Post a Comment