[FFmpeg-devel] [patch][OpenHEVC]added ASM functions for epel + qpel

Wed Mar 5 13:57:43 CET 2014

> +cglobal hevc_put_hevc_epel_h2_8, 8, 14, 15 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, rfilter

r3src, my, rfilter (if not PIC) and width are unused - remove them. You're
certainly not using 14 GPRs, so 8, 14 is way too agressive. Try to get
something like 6, 6 on no-pic and 6, 7 on pic. See below.

Why function name is like hevc_hevc?

> +    sub             srcq, 1

Why? Just subtract one from src when you dereference from it [srcq-1]
instead of [srcq]).

> +    EPEL_FILTER        8, mx
> +
> +    LOOP_INIT  epel_h_h_2_8

Labels with a dot are local and thus don't need full function name scope,
just loop instead of epel_h_h_2_8 is fine.

> +    EPEL_LOAD         8, src, 1
> +    EPEL_COMPUTE       8, 2
> +    PEL_STORE2       dst, m0, m1
> +    LOOP_END   epel_h_h_2_8, dst, dststride, src, srcstride
> +    RET

OK, so the actual code. For play, can you show the _actual disassembly_
that all these macros eventually got us to? I wonder what it actually gives.

Example:

> +cglobal hevc_put_hevc_epel_h4_8, 8, 14, 15 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, rfilter
> +    sub             srcq, 1
> +    EPEL_FILTER        8, mx
> +
> +    LOOP_INIT  epel_h_h_4_8
> +
> +    EPEL_LOAD          8, src, 1
> +    EPEL_COMPUTE       8, 4
> +    PEL_STORE4       dst, m0, m1
> +    LOOP_END   epel_h_h_4_8, dst, dststride, src, srcstride
> +    RET

I expect to get out of this:

lea mxq, [mxq*8-8]
lea rfilterq, [address] # because I'm on mac64 - this shouldn't be here on
32bit or linux64 w/o pic
mova m4, [rfilterq+mxq*4]
mova m5, [rfilterq+mxq*4+16]
.loop:
movd m0, [srcq]
movd m1, [srcq+1]
movd m2, [srcq+2]
movd m3, [srcq+3]
add srcq, srcstrideq # or lea, don't care
punpcklbw m0, m1
punpcklbw m2, m3
pmaddubsw m0, m4
pmaddubsw m2, m5
paddw m0, m2
packuswb m0, m0
movd [dstq], m0
add dstq, dststrideq # or lea, don't care
dec heightd
jg .loop
ret

Note that it fits in 8byte registers and we only use 6, so we can actually
use MMX registers (still ssse3 instruction set, b/c pmaddubsw) instead of
XMM. Also note I didn't use r3src or my (and width isn't used anywhere, so
please remove that) and suddenly you're down to 6 GPRs (dst, dststr, src,
srcstr, h, mx - rfilterq is PIC-only so 64bit only) so this actually is
32bit compatible now.

> +cglobal hevc_put_hevc_epel_h24_8, 8, 14, 15 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, tdst, tsrc, rfilter
> +    sub             srcq, 1
> +    EPEL_FILTER        8, mx
> +
> +    LOOP_INIT  epel_h_h_24_8
> +
> +    EPEL_LOAD          8, src, 1
> +    EPEL_COMPUTE       8, 16
> +    PEL_STORE16      dst, m0, m1
> +    lea            tsrcq, [srcq + 16]
> +    EPEL_LOAD          8, tsrcq, 1
> +    EPEL_COMPUTE       8, 8
> +    lea            tdstq, [dstq + 32]
> +    PEL_STORE8     tdstq, m0, m1
> +    LOOP_END   epel_h_h_24_8, dst, dststride, src, srcstride
> +    RET

As I said for the fullpel versions, not a big fan of this "let's split it
in 2" approach. Try to get rid of tsrcq/tdstq. I also still don't see r3src
being used.

> +cglobal hevc_put_hevc_epel_h2_10, 8, 14, 15 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, rfilter
> +    sub             srcq, 2
> +    EPEL_FILTER       10, mx
> +
> +    LOOP_INIT  epel_h_h_2_10
> +
> +    EPEL_LOAD         10, src, 2
> +    EPEL_COMPUTE      10, 2
> +    PEL_STORE2       dst, m0, m1
> +    LOOP_END   epel_h_h_2_10, dst, dststride, src, srcstride
> +    RET

Is this different than the 8bit one? I imagine that now that you have the
macros, writing out the 8bit and 10bit should be identical and thus you can
do that using a macro.

%macro functions 1 ; bits
cglobal hevc_put_hevc_h2_%1, ..
..
RET

cglobal hevc_put_hevc_h4_%1, ..
..
RET

[etc.]
%endmacro

functions 8
functions 10

Saves half the file in source code, at least for the ones where it matches
(maybe it stops matching at 16pixels, didn't check closely).

> +cglobal hevc_put_hevc_epel_hv2_8, 8, 15, 12 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, rfilter
[..]
> +    PEL_STORE6       dst, m0, m1

Ehm - 2 vs. 6?

> +    punpcklwd         m0, m4, m5
> +    punpcklwd         m2, m6, m7
> +    EPEL_COMPUTE_14

So, you have either (for half registers):

1a
punpcklwd
punpcklwd
pmaddwd
pmaddwd
paddd
packssdw

vs.

1b
pmullw
pmullw
pmullw
pmullw
paddw
paddw
paddw

And for full registers:

2a
punpcklwd
punpcklwd
punpckhwd
punpckhwd
pmaddwd
pmaddwd
pmaddwd
pmaddwd
paddd
paddd
packssdw

vs.

2b
pmullw
pmullw
pmullw
pmullw
paddw
paddw
paddw

I can understand the pmaddwd approach for second pass may be faster for
half-registers, since you fill the register up to full width and save one
instruction - but did you measure it?

Then, for second, you're just spending instructions shuffling. I don't
think 2a is faster than 2b, in fact I expect it to be significantly slower.

> +cglobal hevc_put_hevc_epel_hv64_8, 8, 15, 12 , dst, dststride, src,
srcstride, width, height, mx, my, r3src, tdst, tsrc, rfilter

These functions are getting massive. Can you measure if a simple C wrapper
around hv16 for larger sizes is not slower? It would significantly save in
binary size.

More (qpel, mostly) coming...

Ronald