[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel

Tue Mar 4 13:29:53 CET 2014

Hi,

Continuing...

> +%define QPEL_H_LOAD4 QPEL_H_LOAD2
> +
> +%macro QPEL_H_LOAD8 0
> +    QPEL_H_LOAD2
> +    psrldq            m2, m0,4
> +    psrldq            m3, m0,6
> +%endmacro

These seem unused?

> > +%macro PEL_STORE2 3
> +    movd           [%1q], %2
> +%endmacro
> +%macro PEL_STORE4 3
> +    movq           [%1q], %2
> +%endmacro
> +%macro PEL_STORE6 3
> +    movq           [%1q], %2
> +    psrldq            %2, 8
> +    movd         [%1q+8], %2
> +%endmacro
> +%macro PEL_STORE8 3
> +    movdqa         [%1q], %2
> +%endmacro
> +%macro PEL_STORE12 3
> +    movdqa         [%1q], %2
> +    movq        [%1q+16], %3
> +%endmacro
> +%macro PEL_STORE16 3
> +    PEL_STORE8        %1, %2, %3
> +    movdqa      [%1q+16], %3
> +%endmacro

Most of these macros don't use %3, so make it take 2 arguments? Also I
wonder if you really need macros for some of this...

> +%macro LOOP_INIT 1
> +    pxor             m15, m15                    ; set register at zero

This comment is a little meh ("i++; // increment i").

> +%macro MC_PIXEL_COMPUTE2_8 0
> +    punpcklbw         m1, m0, m15
> +    psllw             m1, 6
> +%endmacro

Why are you unpacking into m1? You're not using m0 afterwards.

> +%macro MC_PIXEL_COMPUTE16_8 0
> +    MC_PIXEL_COMPUTE2_8
> +    punpckhbw         m2, m0, m15
> +    psllw             m2, 6
> +%endmacro
> +
> +%macro MC_PIXEL_COMPUTE2_10 0
> +    psllw             m1, m0, 4
> +%endmacro

So, these are only used for full-pixel MC, right? I think for these kind of
situations, you can see that splitting MC and weighting in two functions is
just not a good idea. I'm going to guess that the large majority of calls
doesn't actually use weighting in practice, so you're effectively (for e.g.
a still) unpacking to words, just to pack back to bytes afterwards.

I'd strongly advise to look into merging weighting and MC together. I
understand it's more code, but most of your code here is boilerplate and I
feel it can be simplified a lot further. A lot of the extra code is just
"stuff" that's not really needed.

> +%define MC_PIXEL_COMPUTE4_8 MC_PIXEL_COMPUTE2_8
> +%define MC_PIXEL_COMPUTE8_8 MC_PIXEL_COMPUTE2_8
[..]
> +%define MC_PIXEL_COMPUTE4_10 MC_PIXEL_COMPUTE2_10
> +%define MC_PIXEL_COMPUTE8_10 MC_PIXEL_COMPUTE2_10

E.g. this isn't needed.

> +%macro EPEL_COMPUTE 2

Here you start needing comments:

%macro EPEL_COMPUTE 2 ; # bitsize (8 or 10), width (4, 8, 16, etc.)

> +%if %1 == 8
> +    pmaddubsw         m0, m13
> +    pmaddubsw         m2, m14
> +    paddw             m0, m2
> +%if %2 == 16
> +    pmaddubsw         m1, m13
> +    pmaddubsw         m3, m14
> +    paddw             m1, m3
> +%endif
> +%else
> +    pmaddwd           m0, m13
> +    pmaddwd           m2, m14
> +    pmaddwd           m1, m13
> +    pmaddwd           m3, m14
> +    paddd             m0, m2
> +    paddd             m1, m3

Why not if %2 == 8 for the second part (i.e. unneeded if w=4)?

> +%if %1 == 10
> +    psrad             m0, 2
> +    psrad             m1, 2
> +%endif
> +%if %1 == 14
> +    psrad             m0, 6
> +    psrad             m1, 6
> +%endif

psrad mX, %1-8

> +%macro EPEL_COMPUTE_14 0
> +    pmaddwd           m0, m11
> +    pmaddwd           m1, m11
> +    pmaddwd           m2, m12
> +    pmaddwd           m3, m12
> +    paddd             m0, m2
> +    paddd             m1, m3
> +    psrad             m0, 6
> +    psrad             m1, 6
> +    packssdw          m0, m1
> +%endmacro

If w<=4, you can remove 3 instructions here. I think you can merge tis
macro with EPEL_COMPUTE.

> +%macro QPEL_COMPUTE 2

Isn't this mostly identical to EPEL_COMPUTE?

> +INIT_XMM sse4                                    ; adds ff_ and _sse4 to
function name

This looks ssse3? Are you sure it's sse4?

> +cglobal hevc_put_hevc_pel_pixels2_8, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> +    LOOP_INIT        pixels_2_8
> +    movdqu            m0, [srcq]            ; load data from source

movd, or even pinsrw, not movdqu. Same for other functions below (w=4 =>
movd, w=6/8 => movq).

> +cglobal hevc_put_hevc_pel_pixels24_8, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> +    LOOP_INIT        pixels_24_8
> +    movdqu            m0, [srcq]
      ; load data from source
> +    MC_PIXEL_COMPUTE16_8
> +    PEL_STORE16      dst, m1, m2
> +    lea            tsrcq, [srcq + 16]
> +    movdqu            m0, [tsrcq]

movq

> +    lea            tdstq, [dstq + 32]
> +    MC_PIXEL_COMPUTE8_8
> +    PEL_STORE8     tdstq, m1, m2
> +    LOOP_END  pixels_24_8, dst, dststride, src, srcstride
> +    RET

I don't really think you're taking advantage of the bigger width here; this
sequence of instructions is essentially identical to if you had a C wrapper
for all functions larger than w=16. Make your macros more configurable and
load directly from [srcq+16] instead of the intermediate leas, that also
avoids needing an extra register for tsrcq/tdstq (and all instructions to
set stuff like that up).

Or just use a C wrapper and avoid all this extra cruft in the assembly. I
think for fullpel MC, it's fine to do fullwidth in assembly, but do use it
to its fullest extent.

> +;10bit pel_pixels
> +
> +cglobal hevc_put_hevc_pel_pixels2_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> +    LOOP_INIT        pixels_2_10
> +    movdqu            m0, [srcq]            ; load data from source
> +    MC_PIXEL_COMPUTE2_10
> +    PEL_STORE2       dst, m1, m2
> +    LOOP_END  pixels_2_10, dst, dststride, src, srcstride
> +    RET
> +
> +cglobal hevc_put_hevc_pel_pixels4_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> +    LOOP_INIT        pixels_4_10
> +    movdqu            m0, [srcq]            ; load data from source
> +    MC_PIXEL_COMPUTE2_10

The funny thing is you're using all these defines further up to have
MC_PIXEL_COMPUTE<n>_10 (and _8), and you end up not actually using them.

My rule is generally: use what leads to smallest source code if it doesn't
affect code generation. So just remove the MC_PIXEL_COMPUTE<n>_10 (and _8)
ones you don't need.

> +cglobal hevc_put_hevc_pel_pixels12_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height, tdst, tsrc
> +    LOOP_INIT        pixels_12_10
> +    movdqu            m0, [srcq]            ; load data from source
> +    MC_PIXEL_COMPUTE8_10
> +    PEL_STORE8       dst, m1, m2
> +    lea            tsrcq, [srcq + 16]
> +    movdqu            m0, [tsrcq]
> +    lea            tdstq, [dstq + 16]

> +    MC_PIXEL_COMPUTE8_10
> +    PEL_STORE4     tdstq, m1, m2
> +    LOOP_END  pixels_12_10, dst, dststride, src, srcstride
> +    RET

Same as further up - remove tsrcq/tdstq, make macros take arguments to
define position within srcq/dstq, and make use of the fact that you have an
assembly-function for full-width instead of a C wrapper.

> +cglobal hevc_put_hevc_pel_pixels24_10, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> +    LOOP_INIT        pixels_24_10
> +movdqu                m0, [srcq]            ; load data from source

Indent.

> +cglobal hevc_put_hevc_pel_pixels32_10, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> +    LOOP_INIT        pixels_32_10
> +movdqu                m0, [srcq]            ; load data from source

Indent.

More follows later.

Ronald