[Ffmpeg-devel] [PATCH] lowres chroma bug

Thu Feb 8 20:40:22 CET 2007

On Thu, 8 Feb 2007, Oleg Metelitsa wrote:
>>> Of course that would only be done for avg_h264_chroma_mc2_mmx2, not for
>>> avg_h264_chroma_mc{4,8}_mmx2.  Maybe this is faster than the using the
>>> 16-bit move?  The same can be done for the put version too:
>>>
>>> @@ -1376,1 +1376,2 @@
>>> -#define H264_CHROMA_OP4(S,D,T)
>>> +#define H264_CHROMA_OP4(S,D,T) "movd 2+" #S ", " #T "\n\t"\
>>> +                               "punpcklwd " #T ", " #D "\n\t"
>
> Why do not use one SSE integer instruction instead of two MMX
> instructions?
>
> So we will have:
>
> #define H264_CHROMA_OP2(S,D,T)   "pinsrw $1, 2+" #S ", " #D " \n\t"
>
> instead of
>
>>> +#define H264_CHROMA_OP2(S,D,T) "movd 2+" #S ", " #T "\n\t"\
>>> +                               "punpcklwd " #T ", " #D "\n\t"

Because pinsrw is slow. I haven't benshmarked that particular code, but 
according to my AMD (K8) optimization manual,
pinsrw: latency at least 9 cycles, sometimes more
movd (destination = memory): latency 2
punpcklwd: latency 2

--Loren Merritt