[FFmpeg-devel] [PATCH] SSE optimization for DCA decoder

Alexander E. Patrakov patrakov
Fri Aug 29 06:55:53 CEST 2008

David Conrad wrote:

> Hi,
> Attached gives me about a 45% faster overall DCA decode on my penryn.
> Name suggestions for the function welcome.

Hi David,

I think that we should completely rewrite this QMF instead (at the very
least, in order to get rid of the name "subband_fir_noidea") and express it
in terms of convolution (or scalar product) combined with a 32-point DCT
(which is already available in the optimized form). When writing my DCA
encoder, I had to treat this QMF as a linear black-box instead of
understanding the existing code. The reason is that the code is almost a
verbatim copy from the spec, and the spec has been obviously obfuscated in
order to make writing the encoder harder. Thus, I want to share some
algebraic properties of the transform, in order to inspire you to write a
better C implementation of it (based on this understanding).

First, don't read the spec. Instead, treat the existing qmf_32_subbands()
function as a black box. You may want to copy it to another file for your
experiments. Pass the following samples to raXin[]:

First sample:
(1, 0, 0, 0, ..., 0) [i.e., "1" in the 0-th subband and "0" in all other
and then 31 copies of the following sample:
(0, 0, 0, 0, ..., 0) [i.e., "0" in all samples].

The black box will output 512 PCM samples. Let's put them into the "double
prototype_filter[512]" array, and call them "response of the black box to
0-th subband". Analogously, you can obtain responses to all other subbands.
So far, you have 32 512-element vectors to work with (as a mathematician,
not as a programmer).

Key points:

1) Divide (element by element) the response to the 1-st subband by the
response to 0-th subband, and use gnuplot to view the result. You'll obtain
a cosine wave.

2) Divide (element by element) the response to the 2-nd subband minus the
response to the 1-st subband by the 0-th subband, and view the result in
gnuplot again. You'll obtain another cosine wave, with half the period.

3) Repeat with other subbands (divide this_subband - previous_subband by
prototype_filter, plot the result), and you'll get an idea how to calculate
the responses to them starting from prototype_filter[]. Here is where the
DCT comes in.

4) [irrelevant to the decoder, vital for the not-yet-released encoder] 
Starting from the 32 response vectors, please apply a circular shift to
each of them (i.e., samples that are shifted through the right end of the
vector, reappear from the left) by amounts that are divisible by 32. At the
end, you'll end up with 512 vectors. Note (by calculating scalar products)
that they form the orthogonal basis.

Then, apply some math to calculate many scalar products at once. Here is how
the transform looks in my encoder (slightly edited for this post, sorry for
line wrapping):

extern const int32_t cos_table[2048];
/* full period with obvious contents,
   e.g. cos_table[0]==2147483647
   and  cos_table[1024]==-2147483647 */
static inline int32_t band_delta_factor(int band, int sample_num)
        int index = band * (2 * sample_num + 1);
        if (band == 0)
                return 0x07ffffff;
                return cos_table[(index & 127) << 4] >> 3;

/* Access a PCM sample in the interleaved multichannel container */
static inline const int32_t *pcm_sample(dcaenc_context c, const int32_t
*container, int sample, int channel)
        return &container[sample * c->channels + channel];

static inline int32_t mul32(int32_t a, int32_t b)
        int64_t r = (int64_t)a * b;
        return (r + 0x80000000) >> 32; /* rounding */

static void dcaenc_subband_transform(dcaenc_context c, const int32_t *input)
        int ch, subs, i, k, j;

        for (ch = 0; ch < c->fullband_channels; ch++) {
                /* History is copied because it is also needed for PSY */
                int32_t hist[512]; /* circular buffer */
                int hist_start = 0;

                for (i = 0; i < 512; i++)
                        hist[i] = c->pcm_history[i][ch];

                for (subs = 0; subs < 32; subs++) {
                        int32_t accum[64];
                        int32_t resp;
                        int band;

                        /* Calculate the convolutions at once */
                        for (i = 0; i < 64; i++)
                                accum[i] = 0;

                        for (k = 48, i = hist_start, j = 0; i < 512; k = (k
+ 1) & 63, i++, j++)
                                accum[k] += mul32(hist[i],
                        for (i = 0; i < hist_start; k = (k + 1) & 63, i++,
                                accum[k] += mul32(hist[i],

                        for (k = 0; k < 32; k++)
                                accum[k] += accum[63 - k];

                        /* This can obviously be optimized using a 32-point
DCT, because band_delta_factor is just some cosines */
                        resp = 0;
                        for (band = 0; band < 32; band++) {
                                for (i = 0; i < 32; i++)
                                        resp += mul32(accum[i],
band_delta_factor(band, i));

                                c->subband_samples[subs][band][ch] = (band &
2) ? (-resp) : resp;

                        /* Copy in 32 new samples from input */
                        for (i = 0; i < 32; i++)
                                hist[i + hist_start] = *pcm_sample(c, input,
subs * 32 + i, ch);

                        hist_start = (hist_start + 32) & 511;

Alexander E. Patrakov

More information about the ffmpeg-devel mailing list