[FFmpeg-devel] Mixed data type in SIMD code?
Michael Niedermayer
michaelni
Wed Mar 5 15:24:14 CET 2008
On Tue, Mar 04, 2008 at 07:36:42PM -0700, Loren Merritt wrote:
> On Tue, 4 Mar 2008, Michael Niedermayer wrote:
>> On Mon, Mar 03, 2008 at 04:30:08PM -0700, Loren Merritt wrote:
>>> On Mon, 3 Mar 2008, Michael Niedermayer wrote:
>>>>
>>>> Also i doubt we use or ever will use packed double.
>>>
>>> flac encoder does. Single isn't precise enough for a linear sum of up
>>> to 16k elements. Reordering the sum to a tree made it half-way
>>> decent decent precision, but also made it as slow as double.
>>
>> What about something like:
>>
>> for(i=0; i<16000;){
>> float sum=0;
>> do{
>> sum+= whatever[i++];
>> }while(i&127);
>> double_sum += sum;
>> }
>
> done.
>
> core2:
> 2039632 dezicycles in autocorr_double_c, 65536 runs, 0 skips
> 771026 dezicycles in autocorr_double_sse2, 65536 runs, 0 skips
> 524713 dezicycles in autocorr_float_sse1, 65536 runs, 0 skips
> 500609 dezicycles in autocorr_float_sse2, 65534 runs, 2 skips
> 432458 dezicycles in autocorr_float_ssse3, 65535 runs, 1 skips
> overall: 4.8%
>
> k8:
> 1776170 dezicycles in autocorr_double_c, 65534 runs, 2 skips
> 1062022 dezicycles in autocorr_double_sse2, 65535 runs, 1 skips
> 932452 dezicycles in autocorr_float_sse1, 65533 runs, 3 skips
> 911259 dezicycles in autocorr_float_sse2, 65534 runs, 2 skips
> overall: 2.5%
Very nice, especially for thouse without sse2 :)
>
> Presumably a cpu without sse2 would gain more.
>
> cost: Some settings don't notice the reduced precision, some lose up to
> .09% bitrate. This doesn't vary much with the length of the single
> precision loop.
I think single precission should only be enabled for lower compression
levels or where it doesnt loose any bitrate ...
[...]
> +#define float_iterations 64 // how long to accumulate single-precision before upgrading
Should be a doxygen comment and uppercase
[...]
> +#define CORR2_LOOP_SSE1(MULT,step) {\
> + double s0=1.0, s1=1.0;\
> + DECLARE_ALIGNED_8( struct {float x[4];}, sumf );\
> + while(j<0) {\
> + asm volatile(\
> + OP2(xorps, 4,4, 5,5)\
> + "1: \n\t"\
> + MULT\
> + OP2(addps, 1,4, 2,5)\
> + "add $"#step"*16, %0 \n\t"\
> + "sub $"#step", %1 \n\t"\
> + "jg 1b \n\t"\
the "sub $"#step", %1 \n\t"\ can be avoided by
changing %0/%3/%4 appropriately
> + OP2(movhlps, 4,1, 5,2)\
> + OP2(addps, 1,4, 2,5)\
> + "movlps %%xmm4, %2 \n\t"\
> + "movlps %%xmm5, 8+%2 \n\t"\
> + :"+&r"(j), "+&r"(k), "=m"(sumf)\
> + :"r"(data1+len), "r"(data1+len-i)\
> + :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5"\
> + );\
> + k = float_iterations;\
> + s0 += (double)sumf.x[0] + (double)sumf.x[1];\
> + s1 += (double)sumf.x[2] + (double)sumf.x[3];\
> + }\
> + autoc_buf[i] = s0;\
> + autoc_buf[i+1] = s1;\
Instead of calculating all lag x,x+1 over the whole array, maybe its faster
to caluclate all lags over a block first and then repeat for the next block.
That is if the whole array is larger than the cache.
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080305/451772cb/attachment.pgp>
More information about the ffmpeg-devel
mailing list