[FFmpeg-devel] [PATCH] SSE dct32() [Was: r23095 - in trunk/libavcodec: ...]

Fri Jun 18 15:27:26 CEST 2010

On 06/11/2010 11:34 PM, Vitor Sessak wrote:
> On 06/08/2010 04:04 PM, Michael Niedermayer wrote:
>> On Tue, Jun 08, 2010 at 12:56:16PM +0200, Vitor Sessak wrote:
>>> On 06/08/2010 01:52 AM, Michael Niedermayer wrote:
>>>> On Sat, Jun 05, 2010 at 07:35:29AM +0200, Vitor Sessak wrote:
>>>>> Moving discussion to -devel...
>>>>>
>>>>> On 05/31/2010 09:59 PM, Vitor Sessak wrote:
>>>>>> On 05/14/2010 05:52 PM, Michael Niedermayer wrote:
>>>>>>> On Fri, May 14, 2010 at 08:39:48AM +0200, Vitor Sessak wrote:
>>>>>>>> Michael Niedermayer wrote:
>>>>>>>>> On Tue, May 11, 2010 at 03:56:45PM -0400, Alex Converse wrote:
>>>>>>>>>> On Tue, May 11, 2010 at 3:52 PM, michael<subversion at mplayerhq.hu>
>>>>>>>>>> wrote:
>>>>>>>>>>> Author: michael
>>>>>>>>>>> Date: Tue May 11 21:52:42 2010
>>>>>>>>>>> New Revision: 23095
>>>>>>>>>>>
>>>>>>>>>>> Log:
>>>>>>>>>>> float based mp1/mp2/mp3 decoders.
>>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>> :)
>>>>>>>>> btw, any volunteers to try to hook it up to our split radix dct
>>>>>>>>> and
>>>>>>>>> or
>>>>>>>>> simd optimize it?
>>>>>>>>
>>>>>>>> Without rdft or dct simd, our split radix code is slower. Ugly hack
>>>>>>>> to test
>>>>>>>> it attached.
>>>>>>>
>>>>>>> if dct32() is faster then it should be used by our generic dct code.
>>>>>>> at least for the plain C case
>>>>>>
>>>>>> I've given a try at a SSE dct32(). It is much faster than current C
>>>>>> code. The only problem is that current code in mpegaudiodec.c
>>>>>> expect two
>>>>>> arguments, one input (which is destructed) and one output. ITOH,
>>>>>> ff_dct_calc() does everything in-place, what does not glue well
>>>>>> with the
>>>>>> current mpegaudiodec.c code. Can you (or anyone else that knows
>>>>>> mpegaudiodec.c well) fix it?
>>>>>
>>>>> I've given a try of making mpegaudiodec.c use the same buffer for dct
>>>>> input
>>>>> and output and it is not trivial. It is much easier (and has no
>>>>> measurable
>>>>> slowdown) to make ff_dct_calc() take both an input and an output
>>>>> pointer
>>>>> as
>>>>> in attached patch.
>>>>>
>>>>> -Vitor
>>>>
>>>>> avfft.c | 2 +-
>>>>> binkaudio.c | 2 +-
>>>>> dct.c | 40 +++++++++++++++++++++++-----------------
>>>>> fft-test.c | 6 ++----
>>>>> fft.h | 11 +++++++++--
>>>>> wmavoice.c | 4 ++--
>>>>> 6 files changed, 38 insertions(+), 27 deletions(-)
>>>>> 91cf0cde9a50a47a8df3fbd171b35535abe00d16 dct_inout.diff
>>>>
>>>> ok if tested and no slowdown is confirmed
>>>
>>> I retested carefully and found a 3% slowdown. It is due to aliasing,
>>> which
>>> does not allow the compiler to unroll the loops. I tested unrolling
>>> by hand
>>> the loops and afterwards it is as fast as before.
>>>
>>> Are you ok with the patch as is or ok if I apply another patch
>>> afterwards
>>> unrolling the loops?
>>
>> i think that a 3% speedloss is significant so iam definitly not ok with
>> something that leads to such speedloss.
>>
>> also if yu test this patch + unroll against svn, i wonder how
>> svn+unroll performs
>> as well as what code cache effects the unroll actually has in actual use
>
> Ok, I took some time to test it really careful and I gave up making a
> code as fast as in-place (to begin with, gcc always get
> register-starved). So I propose the attached patch. At least the faster
> code can be used by the common DCT framework and it makes easier to add
> ASM optimisations.

Ping?

-Vitor