[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Vitor Sessak vitor1001
Fri Jun 11 23:34:21 CEST 2010


On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>> Moving discussion to -devel...
>>>>
>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>> Michael Niedermayer wrote:
>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM, michael<subversion at mplayerhq.hu>
>>>>>>>>> wrote:
>>>>>>>>>> Author: michael
>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>> New Revision: 23095
>>>>>>>>>>
>>>>>>>>>> Log:
>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>> :)
>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct and
>>>>>>>> or
>>>>>>>> simd optimize it?
>>>>>>>
>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly hack
>>>>>>> to test
>>>>>>> it attached.
>>>>>>
>>>>>> if dct32() is faster then it should be used by our generic dct code.
>>>>>> at least for the plain C case
>>>>>
>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>> code. The only problem is that current code in mpegaudiodec.c expect two
>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>> ff_dct_calc() does everything in-place, what does not glue well with the
>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>> mpegaudiodec.c well) fix it?
>>>>
>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>> input
>>>> and output and it is not trivial. It is much easier (and has no
>>>> measurable
>>>> slowdown) to make ff_dct_calc() take both an input and an output pointer
>>>> as
>>>> in attached patch.
>>>>
>>>> -Vitor
>>>
>>>>    avfft.c     |    2 +-
>>>>    binkaudio.c |    2 +-
>>>>    dct.c       |   40 +++++++++++++++++++++++-----------------
>>>>    fft-test.c  |    6 ++----
>>>>    fft.h       |   11 +++++++++--
>>>>    wmavoice.c  |    4 ++--
>>>>    6 files changed, 38 insertions(+), 27 deletions(-)
>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16  dct_inout.diff
>>>
>>> ok if tested and no slowdown is confirmed
>>
>> I retested carefully and found a 3% slowdown. It is due to aliasing, which
>> does not allow the compiler to unroll the loops. I tested unrolling by hand
>> the loops and afterwards it is as fast as before.
>>
>> Are you ok with the patch as is or ok if I apply another patch afterwards
>> unrolling the loops?
>
> i think that a 3% speedloss is significant so iam definitly not ok with
> something that leads to such speedloss.
>
> also if yu test this patch + unroll against svn, i wonder how
> svn+unroll performs
> as well as what code cache effects the unroll actually has in actual use

Ok, I took some time to test it really careful and I gave up making a 
code as fast as in-place (to begin with, gcc always get 
register-starved). So I propose the attached patch. At least the faster 
code can be used by the common DCT framework and it makes easier to add 
ASM optimisations.

-Vitor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dct32_new.diff
Type: text/x-patch
Size: 18030 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100611/79a6ef3a/attachment.bin>



More information about the ffmpeg-devel mailing list