[FFmpeg-devel] Mixed data type in SIMD code?

Wed Mar 5 15:24:14 CET 2008

On Tue, Mar 04, 2008 at 07:36:42PM -0700, Loren Merritt wrote:
> On Tue, 4 Mar 2008, Michael Niedermayer wrote:
>> On Mon, Mar 03, 2008 at 04:30:08PM -0700, Loren Merritt wrote:
>>> On Mon, 3 Mar 2008, Michael Niedermayer wrote:
>>>>
>>>> Also i doubt we use or ever will use packed double.
>>>
>>> flac encoder does. Single isn't precise enough for a linear sum of up
>>> to 16k elements. Reordering the sum to a tree made it half-way
>>> decent decent precision, but also made it as slow as double.
>>
>> What about something like:
>>
>> for(i=0; i<16000;){
>>    float sum=0;
>>    do{
>>        sum+= whatever[i++];
>>    }while(i&127);
>>    double_sum += sum;
>> }
>
> done.
>
> core2:
> 2039632 dezicycles in autocorr_double_c, 65536 runs, 0 skips
> 771026 dezicycles in autocorr_double_sse2, 65536 runs, 0 skips
> 524713 dezicycles in autocorr_float_sse1, 65536 runs, 0 skips
> 500609 dezicycles in autocorr_float_sse2, 65534 runs, 2 skips
> 432458 dezicycles in autocorr_float_ssse3, 65535 runs, 1 skips
> overall: 4.8%
>
> k8:
> 1776170 dezicycles in autocorr_double_c, 65534 runs, 2 skips
> 1062022 dezicycles in autocorr_double_sse2, 65535 runs, 1 skips
> 932452 dezicycles in autocorr_float_sse1, 65533 runs, 3 skips
> 911259 dezicycles in autocorr_float_sse2, 65534 runs, 2 skips
> overall: 2.5%

Very nice, especially for thouse without sse2 :)

>
> Presumably a cpu without sse2 would gain more.
>
> cost: Some settings don't notice the reduced precision, some lose up to 
> .09% bitrate. This doesn't vary much with the length of the single 
> precision loop.

I think single precission should only be enabled for lower compression
levels or where it doesnt loose any bitrate ...

[...]
> +#define float_iterations 64 // how long to accumulate single-precision before upgrading

Should be a doxygen comment and uppercase

[...]
> +#define CORR2_LOOP_SSE1(MULT,step) {\
> +    double s0=1.0, s1=1.0;\
> +    DECLARE_ALIGNED_8( struct {float x[4];}, sumf );\
> +    while(j<0) {\
> +        asm volatile(\
> +            OP2(xorps, 4,4, 5,5)\

> +            "1:                     \n\t"\
> +            MULT\
> +            OP2(addps, 1,4, 2,5)\
> +            "add $"#step"*16, %0    \n\t"\
> +            "sub $"#step", %1       \n\t"\
> +            "jg 1b                  \n\t"\

the "sub $"#step", %1       \n\t"\ can be avoided by
changing %0/%3/%4 appropriately

> +            OP2(movhlps,  4,1, 5,2)\
> +            OP2(addps,    1,4, 2,5)\
> +            "movlps  %%xmm4,   %2   \n\t"\
> +            "movlps  %%xmm5, 8+%2   \n\t"\
> +            :"+&r"(j), "+&r"(k), "=m"(sumf)\
> +            :"r"(data1+len), "r"(data1+len-i)\
> +            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5"\
> +        );\
> +        k = float_iterations;\
> +        s0 += (double)sumf.x[0] + (double)sumf.x[1];\
> +        s1 += (double)sumf.x[2] + (double)sumf.x[3];\
> +    }\
> +    autoc_buf[i] = s0;\
> +    autoc_buf[i+1] = s1;\

Instead of calculating all lag x,x+1 over the whole array, maybe its faster
to caluclate all lags over a block first and then repeat for the next block.
That is if the whole array is larger than the cache.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080305/451772cb/attachment.pgp>