[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel

Sat Mar 8 13:54:02 CET 2014

Hi,

> +cglobal hevc_put_hevc_epel_hv12_8, 7, 11, 12 , dst, dststride, src,
srcstride, height, mx, my, r3src, tsrc, rfilter
[..]
> +.loop
> +    EPEL_LOAD          8, srcq, 1, 12
> +    EPEL_COMPUTE       8, 6, m14, m15
> +    SWAP              m4, m0
> +    lea            tsrcq, [srcq + srcstrideq]
> +    EPEL_LOAD          8, tsrcq, 1, 12
> +    EPEL_COMPUTE       8, 6, m14, m15
> +    SWAP              m5, m0
> +    lea            tsrcq, [tsrcq + srcstrideq]
> +    EPEL_LOAD          8, tsrcq, 1, 12
> +    EPEL_COMPUTE       8, 6, m14, m15
> +    SWAP              m6, m0
> +    lea            tsrcq, [tsrcq + srcstrideq]
> +    EPEL_LOAD          8, tsrcq, 1, 12
> +    EPEL_COMPUTE       8, 6, m14, m15
> +    SWAP              m7, m0
> +    punpcklwd         m0, m4, m5
> +    punpckhwd         m1, m4, m5
> +    punpcklwd         m2, m6, m7
> +    punpckhwd         m3, m6, m7
> +    EPEL_COMPUTE      14, 8, m12, m13
> +    PEL_STORE8      dstq, m0, m1
[.. that again for next 4 pixels ..]
> +    LOOP_END         dst, dststride, src, srcstride
> +    RET

So, this is going to be _hugely_ inefficient, right? You're basically
redoing all 4 horizontal passes for each 1 output line (i.e. 4xn_lines),
rather than 3+n_lines.

I can only imagine that you're doing that because you may not have enough
registers to cache 8+4 pixels (to make 12 in total), but really, if that's
the case, just write a C wrapper around 8+4. That'll be tons faster than
this.

> +cglobal hevc_put_hevc_epel_hv12_10, 7, 11, 12 , dst, dststride, src,
srcstride, height, mx, my, r3src, tsrc, rfilter

Same comment for this one.

Ronald