[FFmpeg-devel] [PATCH] libavfilter: add atempo filter (revised patch v4)
Pavel Koshevoy
pkoshevoy at gmail.com
Tue Jun 12 02:55:00 CEST 2012
On 6/11/2012 5:15 PM, Stefano Sabatini wrote:
> On date Sunday 2012-06-10 16:49:17 -0600, Pavel Koshevoy encoded:
[...]
>> + if (tempo < 0.5 || tempo > 2.0) {
>> + av_log(ctx, AV_LOG_ERROR, "Tempo value %f exceeds [0.5, 2.0] range\n",
>> + tempo);
>> + return AVERROR(EINVAL);
>> + }
> Just out of curiosity: can you tell shortly what happen with out of
> range values? (in other words, why the algorithm can't work well).
At 0.5 tempo there is already a slight echo. At 2.0 tempo fast speech
becomes too fast to understand. Practical useful range of tempo values
for this filter is about [0.8, 1.5]. If someone needs to step outside
the [0.5, 2.0] range I suggest they daisy chain two or more atempo filters.
The algorithm works by blending together a successive sequence of audio
fragments.
In this implementation the output fragments always overlap 50%. When
tempo is 2 or greater the input fragments no longer overlap. Actually,
due to alignment correction some input fragments may not overlap even
when tempo is less than 2. When input fragments do not overlap a
portion of the waveform between adjacent input fragments is skipped --
definite loss of information.
When tempo is less than 1 portions of the input waveform are repeated in
adjacent audio fragments. When these fragments are aligned and blended
for output these repeated waveforms may match well (in which case you
probably will not hear any artifacts) or may overlap an unrelated
waveform feature. Blending with Hann window function is supposed to
minimize the artifacts when aligned audio fragments do not match well.
In practice this means that misaligned portions of the waveform may be
output attenuated -- it may sound like an echo, or a ringing in the
voice as if someone it talking inside a large steel pipe.
How well successive input fragments align together depends on how the
previous fragments were aligned. Cumulative alignment correction
(output fragment drift) is restricted to [-window / 2, window / 2) range
in order to avoid deviating from the target tempo.
When tempo is near 1 or less than 1 the input audio fragments tend to
re-align to their original position in the input waveform. The
cumulative alignment correction restricts the overlap range eventually
forcing the fragments into less optimal alignment. In my experience
this becomes more obvious as tempo decreases, however I never tried
tempo less than 0.5. It should work, but does anyone really need it?
I didn't know anything about WSOLA until about a month ago. I actually
wrote a test app to help me visualize the waveform and to experiment
with the algorithm. The source code is here in case anyone is
interested: http://aragog.com/~pavel/src/apprenticeaudio/
There are windows binaries available (although I am not certain they use
the latest alignment mechanism):
http://aragog.com/~pavel/download/apprenticeaudio/
Sorry, this probably wasn't the short answer you were asking for.
Pavel
More information about the ffmpeg-devel
mailing list