[FFmpeg-user] resolution of the waterfall diagram of typical mp3 file

Mon Aug 8 01:48:06 EEST 2016

On 2016-08-07 14:28, Nicolas George wrote:
> 
> You can not compute the spectrum of a single sample, that does not make
> sense mathematically. The spectrum needs to be computed on the whole 
> stream,
> or at least, if you want to observe how it evolves during time, over a
> window large enough.

Then it's my mistake. I'm explaining it wrong - sorry for that, and 
allow me to rephrase.

I start with a mono audio file containing one song - a few minutes of 
audio. Let's say the "quality" here is arbitrarily high, for simplicity.

Using python/numpy or some other tools, I calculate the spectrum of the 
whole song, either all at once if possible, or using a reasonably large, 
shifting time window.

I store that spectrum in a matrix. In the time dimension, the matrix has 
T rows per second (depends on the length of the song). In the frequency 
dimension, the matrix has F rows (frequency buckets or bins). In each 
cell, I store one value using B bits (the color of the waterfall, or the 
height of the 3D representation of the spectrum).

I then convert the matrix back into a PCM representation.

I need to determine the matrix parameters T, F, and B, so that the final 
PCM file has about as much information (about the same "sound quality", 
however you want to define that) as if it was extracted from an MP3 
file, 44.1 kHz, 128 kbps CBR.

I understand that the frequency bins do not have constant width, but 
rather their upper/lower frequency limits have constant ratio (similar 
to octaves on a keyboard, but different ratio here).

The purpose of this whole exercise is to run some computations on the 
full spectrum (the matrix). I need to minimize the size of the matrix, 
while keeping the time and frequency resolutions pretty decent. I've 
decided that the "sound quality" of MP3 / 44.1 / 128 CBR is good enough, 
so I'm trying to imitate those respective resolutions, as used by MP3.

I suspect the MP3 encoding algorithm is more complex than using a fixed 
size matrix, so I'm only asking for a rough approximation, like a back 
of the envelope estimate. How many rows per second, how many frequency 
buckets, how many bits per cell, so that the result is not worse than 
that reference MP3/44.1/128 file? It doesn't have to be the exact same 
signal degradation, but if it's subjectively close then that's enough 
for me.

-- 
Florin Andrei
http://florin.myip.org/