[FFmpeg-devel] [PATCH] libavfilter: add atempo filter (revised patch v4)

Tue Jun 12 02:55:00 CEST 2012

On 6/11/2012 5:15 PM, Stefano Sabatini wrote:
> On date Sunday 2012-06-10 16:49:17 -0600, Pavel Koshevoy encoded:
[...]
>> +    if (tempo < 0.5 || tempo > 2.0) {
>> +        av_log(ctx, AV_LOG_ERROR, "Tempo value %f exceeds [0.5, 2.0] range\n",
>> +               tempo);
>> +        return AVERROR(EINVAL);
>> +    }
> Just out of curiosity: can you tell shortly what happen with out of
> range values? (in other words, why the algorithm can't work well).

At 0.5 tempo there is already a slight echo.  At 2.0 tempo fast speech 
becomes too fast to understand.  Practical useful range of tempo values 
for this filter is about [0.8, 1.5].  If someone needs to step outside 
the [0.5, 2.0] range I suggest they daisy chain two or more atempo filters.

The algorithm works by blending together a successive sequence of audio 
fragments.

In this implementation the output fragments always overlap 50%. When 
tempo is 2 or greater the input fragments no longer overlap. Actually, 
due to alignment correction some input fragments may not overlap even 
when tempo is less than 2.  When input fragments do not overlap a 
portion of the waveform between adjacent input fragments is skipped -- 
definite loss of information.

When tempo is less than 1 portions of the input waveform are repeated in 
adjacent audio fragments.  When these fragments are aligned and blended 
for output these repeated waveforms may match well (in which case you 
probably will not hear any artifacts) or may overlap an unrelated 
waveform feature.  Blending with Hann window function is supposed to 
minimize the artifacts when aligned audio fragments do not match well.  
In practice this means that misaligned portions of the waveform may be 
output attenuated -- it may sound like an echo, or a ringing in the 
voice as if someone it talking inside a large steel pipe.

How well successive input fragments align together depends on how the 
previous fragments were aligned.  Cumulative alignment correction 
(output fragment drift) is restricted to [-window / 2, window / 2) range 
in order to avoid deviating from the target tempo.

When tempo is near 1 or less than 1 the input audio fragments tend to 
re-align to their original position in the input waveform.  The 
cumulative alignment correction restricts the overlap range eventually 
forcing the fragments into less optimal alignment.   In my experience 
this becomes more obvious as tempo decreases, however I never tried 
tempo less than 0.5.  It should work, but does anyone really need it?

I didn't know anything about WSOLA until about a month ago.  I actually 
wrote a test app to help me visualize the waveform and to experiment 
with the algorithm.  The source code is here in case anyone is 
interested:  http://aragog.com/~pavel/src/apprenticeaudio/

There are windows binaries available (although I am not certain they use 
the latest alignment mechanism): 
http://aragog.com/~pavel/download/apprenticeaudio/

Sorry, this probably wasn't the short answer you were asking for.

     Pavel