[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel
Ronald S. Bultje
rsbultje at gmail.com
Tue Mar 4 13:29:53 CET 2014
Hi,
Continuing...
> +%define QPEL_H_LOAD4 QPEL_H_LOAD2
> +
> +%macro QPEL_H_LOAD8 0
> + QPEL_H_LOAD2
> + psrldq m2, m0,4
> + psrldq m3, m0,6
> +%endmacro
These seem unused?
> > +%macro PEL_STORE2 3
> + movd [%1q], %2
> +%endmacro
> +%macro PEL_STORE4 3
> + movq [%1q], %2
> +%endmacro
> +%macro PEL_STORE6 3
> + movq [%1q], %2
> + psrldq %2, 8
> + movd [%1q+8], %2
> +%endmacro
> +%macro PEL_STORE8 3
> + movdqa [%1q], %2
> +%endmacro
> +%macro PEL_STORE12 3
> + movdqa [%1q], %2
> + movq [%1q+16], %3
> +%endmacro
> +%macro PEL_STORE16 3
> + PEL_STORE8 %1, %2, %3
> + movdqa [%1q+16], %3
> +%endmacro
Most of these macros don't use %3, so make it take 2 arguments? Also I
wonder if you really need macros for some of this...
> +%macro LOOP_INIT 1
> + pxor m15, m15 ; set register at zero
This comment is a little meh ("i++; // increment i").
> +%macro MC_PIXEL_COMPUTE2_8 0
> + punpcklbw m1, m0, m15
> + psllw m1, 6
> +%endmacro
Why are you unpacking into m1? You're not using m0 afterwards.
> +%macro MC_PIXEL_COMPUTE16_8 0
> + MC_PIXEL_COMPUTE2_8
> + punpckhbw m2, m0, m15
> + psllw m2, 6
> +%endmacro
> +
> +%macro MC_PIXEL_COMPUTE2_10 0
> + psllw m1, m0, 4
> +%endmacro
So, these are only used for full-pixel MC, right? I think for these kind of
situations, you can see that splitting MC and weighting in two functions is
just not a good idea. I'm going to guess that the large majority of calls
doesn't actually use weighting in practice, so you're effectively (for e.g.
a still) unpacking to words, just to pack back to bytes afterwards.
I'd strongly advise to look into merging weighting and MC together. I
understand it's more code, but most of your code here is boilerplate and I
feel it can be simplified a lot further. A lot of the extra code is just
"stuff" that's not really needed.
> +%define MC_PIXEL_COMPUTE4_8 MC_PIXEL_COMPUTE2_8
> +%define MC_PIXEL_COMPUTE8_8 MC_PIXEL_COMPUTE2_8
[..]
> +%define MC_PIXEL_COMPUTE4_10 MC_PIXEL_COMPUTE2_10
> +%define MC_PIXEL_COMPUTE8_10 MC_PIXEL_COMPUTE2_10
E.g. this isn't needed.
> +%macro EPEL_COMPUTE 2
Here you start needing comments:
%macro EPEL_COMPUTE 2 ; # bitsize (8 or 10), width (4, 8, 16, etc.)
> +%if %1 == 8
> + pmaddubsw m0, m13
> + pmaddubsw m2, m14
> + paddw m0, m2
> +%if %2 == 16
> + pmaddubsw m1, m13
> + pmaddubsw m3, m14
> + paddw m1, m3
> +%endif
> +%else
> + pmaddwd m0, m13
> + pmaddwd m2, m14
> + pmaddwd m1, m13
> + pmaddwd m3, m14
> + paddd m0, m2
> + paddd m1, m3
Why not if %2 == 8 for the second part (i.e. unneeded if w=4)?
> +%if %1 == 10
> + psrad m0, 2
> + psrad m1, 2
> +%endif
> +%if %1 == 14
> + psrad m0, 6
> + psrad m1, 6
> +%endif
psrad mX, %1-8
> +%macro EPEL_COMPUTE_14 0
> + pmaddwd m0, m11
> + pmaddwd m1, m11
> + pmaddwd m2, m12
> + pmaddwd m3, m12
> + paddd m0, m2
> + paddd m1, m3
> + psrad m0, 6
> + psrad m1, 6
> + packssdw m0, m1
> +%endmacro
If w<=4, you can remove 3 instructions here. I think you can merge tis
macro with EPEL_COMPUTE.
> +%macro QPEL_COMPUTE 2
Isn't this mostly identical to EPEL_COMPUTE?
> +INIT_XMM sse4 ; adds ff_ and _sse4 to
function name
This looks ssse3? Are you sure it's sse4?
> +cglobal hevc_put_hevc_pel_pixels2_8, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> + LOOP_INIT pixels_2_8
> + movdqu m0, [srcq] ; load data from source
movd, or even pinsrw, not movdqu. Same for other functions below (w=4 =>
movd, w=6/8 => movq).
> +cglobal hevc_put_hevc_pel_pixels24_8, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> + LOOP_INIT pixels_24_8
> + movdqu m0, [srcq]
; load data from source
> + MC_PIXEL_COMPUTE16_8
> + PEL_STORE16 dst, m1, m2
> + lea tsrcq, [srcq + 16]
> + movdqu m0, [tsrcq]
movq
> + lea tdstq, [dstq + 32]
> + MC_PIXEL_COMPUTE8_8
> + PEL_STORE8 tdstq, m1, m2
> + LOOP_END pixels_24_8, dst, dststride, src, srcstride
> + RET
I don't really think you're taking advantage of the bigger width here; this
sequence of instructions is essentially identical to if you had a C wrapper
for all functions larger than w=16. Make your macros more configurable and
load directly from [srcq+16] instead of the intermediate leas, that also
avoids needing an extra register for tsrcq/tdstq (and all instructions to
set stuff like that up).
Or just use a C wrapper and avoid all this extra cruft in the assembly. I
think for fullpel MC, it's fine to do fullwidth in assembly, but do use it
to its fullest extent.
> +;10bit pel_pixels
> +
> +cglobal hevc_put_hevc_pel_pixels2_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> + LOOP_INIT pixels_2_10
> + movdqu m0, [srcq] ; load data from source
> + MC_PIXEL_COMPUTE2_10
> + PEL_STORE2 dst, m1, m2
> + LOOP_END pixels_2_10, dst, dststride, src, srcstride
> + RET
> +
> +cglobal hevc_put_hevc_pel_pixels4_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height
> + LOOP_INIT pixels_4_10
> + movdqu m0, [srcq] ; load data from source
> + MC_PIXEL_COMPUTE2_10
The funny thing is you're using all these defines further up to have
MC_PIXEL_COMPUTE<n>_10 (and _8), and you end up not actually using them.
My rule is generally: use what leads to smallest source code if it doesn't
affect code generation. So just remove the MC_PIXEL_COMPUTE<n>_10 (and _8)
ones you don't need.
> +cglobal hevc_put_hevc_pel_pixels12_10, 8, 14, 0 , dst, dststride, src,
srcstride,width,height, tdst, tsrc
> + LOOP_INIT pixels_12_10
> + movdqu m0, [srcq] ; load data from source
> + MC_PIXEL_COMPUTE8_10
> + PEL_STORE8 dst, m1, m2
> + lea tsrcq, [srcq + 16]
> + movdqu m0, [tsrcq]
> + lea tdstq, [dstq + 16]
> + MC_PIXEL_COMPUTE8_10
> + PEL_STORE4 tdstq, m1, m2
> + LOOP_END pixels_12_10, dst, dststride, src, srcstride
> + RET
Same as further up - remove tsrcq/tdstq, make macros take arguments to
define position within srcq/dstq, and make use of the fact that you have an
assembly-function for full-width instead of a C wrapper.
> +cglobal hevc_put_hevc_pel_pixels24_10, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> + LOOP_INIT pixels_24_10
> +movdqu m0, [srcq] ; load data from source
Indent.
> +cglobal hevc_put_hevc_pel_pixels32_10, 8, 14, 15, dst, dststride, src,
srcstride,width,height, tdst, tsrc
> + LOOP_INIT pixels_32_10
> +movdqu m0, [srcq] ; load data from source
Indent.
More follows later.
Ronald
More information about the ffmpeg-devel
mailing list