Signal Processing / Audio ML

Mel Spectrogram —
Explained From First Principles

Maverick · 09 May 2026

Who this is for A junior who has just learned what an FFT is. They know that FFT turns a signal into freq magnitudes. They don't know why that's not enough.

Note i use freq for frequency to save myself from typing 5 extra ch everytime. and also not to use fq or fr, that would be too short. (freq is the sweetspot for me).

The Problem With Raw FFT Output

the main problem with raw FFT is it gives us the output in bins, which are linearly spaced. if you have used N samples, then the FFT gives you N samples out of which half of them are mirrored. But bins 0 and N/2 dont (mirror rule is bin k mirrors bin N-k, so 0 and N/2 bins dont have a mirror brother).

imagine 0 is sitting at the top of the circle, and N/2 would be at the bottom of the circle. when you fold the circle along this vertical diameter, 0 and N/2 are the ones that dont have a mirror. so in total the part that we need is N/2 + 1 bins (0, 1, 2, ..., N/2). but hey, these bins are linearly spaced in terms of freq.

bin 0 → bin 12 has the same gap as bin 100 → bin 112. but our ears are not like that. we perceive the difference in the lower end much bigger than the same difference in the higher end of the freq spectrum. what i mean is the jump from 100–120 Hz is very big to our ears compared to the jump from 1500–1520 Hz, although the difference is the same in both cases (20 Hz).

The Mel Scale — Matching Human Perception

The mel scale is exactly the tool that solves this problem. it was a scale that was proposed in 1937 by Stanley Smith Stevens, John Volkmann, and Edwin Newman. it was designed to be a perceptual scale of pitch which listeners judged to be equal in distance from one another.

so, this is the scale that we understand. if you look up ‘Pitch vs freq graph’ online, then you get a logarithmic graph. so that graph means for linear increase in pitch, the freq should increase logarithmically (obviously). along this line, the formula for conversion from freq to mel goes like this:

m = 2595 × log₁₀( 1 + f / 700 )

you can figure the inverse (from mel to hz) yourself — but here it is:

f = 700 × ( 10^{m / 2595} − 1 )

The Filterbank — Compressing 1025 Bins Into Useful Bands

so to convert any basic spectrogram (the one you get from STFT) to mel spectrogram, first you need to construct mel filterbanks.

so you start with the min and max freq and convert them to min and max mel points using the above formula. then you decide how many mel points you would need for your case. usually i use 40, 70, 120. lets call it n_mels. after that, you would want to create n_mels + 2 equally spaced mel points between min and max mel points, both inclusive. n_mels + 2 because we are gonna form a triangle filter (left, peak and right). each filter needs three consecutive mel points. out of (n_mels + 2) points, each of the first n_mels points would act as left for a filter, thus giving us n_mels filters. we chose triangular filters because this helps efficiently averaging the energy within a spectrum.

given that we already formed our n_mels equally spaced mel points, you can convert them into their respective freq points. using numpy, it would be a simple formula that you must have figured out in the previous section. now you can convert the freq points to bin indices to see the gap for yourself. for a clip with a given sr (sample rate), lets say you pick a frame having N samples. then the bin spacing is sr/N. for any bin k, the respective freq is:

f_k = k × sr / N

its that simple. use this formula to get the bin indices list.

so yeah, since we are dealing with triangles, and we need n_mels filters, we could do that by creating a matrix of dim n_mels × 1025 (if you used a 2048-point FFT, the N/2 + 1 deal). each row is a filter. for example, to make the third filter, we use the 3rd, 4th and 5th points from the (n_mels + 2) points that we created earlier:

left → 3rd point
peak → 4th point
right → 5th point

we want a triangular filter, so for any point j between left and peak (excluding peak), we give it a weight:

w(j) = ( j − left ) / ( peak − left )

for any point j between peak and right (both inclusive):

w(j) = ( right − j ) / ( right − peak )

as you might see, the weights go from 0 at left till 1 at peak and from 1 at peak to 0 at right. this is the triangle that we talked about. we do the same for all the n_mels filters we talked about.

Applying the Filters — One Number Per Band

once we constructed the filters. its now the time to apply it to our spectrogram. this is a simple step, but has a deeper meaning.

our spectrogram will have shape of (n_frames × 1025) and our filterbank now has a shape of (n_mels × 1025). to apply filter to all the 1025 bins we multiply both the matrices:

mel spectrogram = filterbank _{[ n_mels × 1025 ]} × spectrogram _{[ 1025 × n_frames ]}

this would result in a matrix whose shape is (n_mels × n_frames). what this means is we compressed the 1025 bins to just n_mels. going by matrix multiplication, this means we multiplied each of the 1025 values in a filter with 1025 bins, and we did a weighted sum operation. the single number result of this weighted sum operation represents the energy in that band.

if you thought for a moment, ‘wait arent these filters of different width?’, yep, you’re right and having right intuitions. imagine having rainfall across the city. you place a small bucket at one end of the city and HUGE bucket at the other end. after the rainfall, if you check which bucket holds most water, which bucket would it be? obviously the one with the big size. just because the bucket is big, it holds more water. likewise, just because the filter is wider doesnt mean more energy is there. we need to normalise this, so we divide by the width of the filter. that’s it, we constructed the filters and applied it.

Log Compression — Why We Take log10

there is this concept of dynamic range. it refers to difference between the loudest and silent part of a clip. technically, its the ratio between the largest and smallest measurable values of a signal. like, a whisper having a very small amplitude and jet engine having a very big amplitude. the problem is this ratio could be billions to one, so we convert them to logarithmic scale. otherwise dealing with such large and small numbers at the same time isnt ideal.

taking log stabilizes the features and reduces the variances. if the intensity of the sound doubles, our ears dont perceive them twice as loud. we perceive loudness logarithmically. weber-fechner idea is perceived loudness is directly proportional to log(intensity).

also for computing reasons we add 1e-9 to x when calculating log(x). so we actually compute:

log( x + 10⁻⁹ )

this is to avoid the log(0) problem. as x approaches 0⁺, log(x) approaches −∞. so what we essentially do is when there is zero sound, we still take it as very very tiny sound instead of a pure zero.

The Full Pipeline (Summary)

Audio signal
    → STFT (all frames)          shape: (n_frames, n_bins)
    → Mel filterbank             shape: (n_bins, n_mels)    [matrix multiply]
    → Mel spectrogram            shape: (n_frames, n_mels)
    → Log compression (dB)       shape: (n_frames, n_mels)
    → Plot: time × mel bands

STFT is the algorithm that convert the signal from time domain to frequency domain. we go from amplitude vs time to frequency vs time. this is by taking numerous overlapping frames and calculating the fft on each one of them.
Mel filterbank is the tool that we use to apply mel filters so that the scale matches our hearing scale.
Mel spectrogram is just applying mel filterbank to normal spectrogram (output of STFT). but the implication is huge, this is what we feed to audio ml pipelines.
Log compression is to make sure that we dont work with significantly different scales of ranges. fyi, the threshold of our hearing is 10⁻¹² W/m² and painful sound is 1 W/m², a ratio of 10¹².

What I Got Wrong (Honest Section)

the time vs frequency tradeoff took me the longest to digest and understand. this is fundamental nature of sound. getting super clarity of frequency (means more bins) meant you get a blurry vision of time. and vice versa too.
i realised just np.abs(fft) is not energy. rather it is np.abs(fft)**2. although minor mistake, it differs in having a wrong fundamental.
understanding the triangular filters was hell for me, without proper materials. i spent like 6 hours spread across 2-3 days on this one.
the why of ‘N/2 + 1’ was also hard for me because i imagined it the wrong way.

Code Reference

Implementation:https://github.com/vinylmav/audio-foundation/blob/main/mel_spectrogram_full.py

Trace script:https://github.com/vinylmav/audio-foundation/blob/main/trace_mel_pipeline.py

Audio

Search This Blog

Understand the intuition of Mel Spectrograms

Mel Spectrogram —
Explained From First Principles

The Problem With Raw FFT Output

The Mel Scale — Matching Human Perception

The Filterbank — Compressing 1025 Bins Into Useful Bands

Applying the Filters — One Number Per Band

Log Compression — Why We Take log10

The Full Pipeline (Summary)

What I Got Wrong (Honest Section)

Code Reference

Comments

Post a Comment

Understand the intuition of Mel Spectrograms

Mel Spectrogram —Explained From First Principles

The Problem With Raw FFT Output

The Mel Scale — Matching Human Perception

The Filterbank — Compressing 1025 Bins Into Useful Bands

Applying the Filters — One Number Per Band

Log Compression — Why We Take log10

The Full Pipeline (Summary)

What I Got Wrong (Honest Section)

Code Reference

Comments

Post a Comment

Mel Spectrogram —
Explained From First Principles