[FFmpeg-devel] [PATCH] avfilter/x86/vf_blend.asm: add hardmix and phoenix sse2 SIMD
Ronald S. Bultje
rsbultje at gmail.com
Wed Oct 7 13:37:08 CEST 2015
Hi,
On Wed, Oct 7, 2015 at 5:38 AM, Paul B Mahol <onemda at gmail.com> wrote:
> Signed-off-by: Paul B Mahol <onemda at gmail.com>
> ---
> libavfilter/x86/vf_blend.asm | 62
> +++++++++++++++++++++++++++++++++++++++++
> libavfilter/x86/vf_blend_init.c | 14 ++++++++++
> 2 files changed, 76 insertions(+)
>
> diff --git a/libavfilter/x86/vf_blend.asm b/libavfilter/x86/vf_blend.asm
> index 167e72b..7180817 100644
> --- a/libavfilter/x86/vf_blend.asm
> +++ b/libavfilter/x86/vf_blend.asm
> @@ -27,6 +27,8 @@ SECTION_RODATA
>
> pw_128: times 8 dw 128
> pw_255: times 8 dw 255
> +pb_128: times 16 db 128
> +pb_255: times 16 db 255
>
> SECTION .text
>
> @@ -273,6 +275,36 @@ cglobal blend_darken, 9, 10, 2, 0, top, top_linesize,
> bottom, bottom_linesize, d
> jg .nextrow
> REP_RET
>
> +cglobal blend_hardmix, 9, 10, 3, 0, top, top_linesize, bottom,
> bottom_linesize, dst, dst_linesize, width, start, end
> + add topq, widthq
> + add bottomq, widthq
> + add dstq, widthq
> + sub endq, startq
> + neg widthq
> +.nextrow:
> + mov r10q, widthq
> + %define x r10q
>
You're saying that you use 10 regs, but you're using r10, which is the
11th. Use r9 here, or specify that you use 11.
Now, more generally, you're using a lot of regs in all your simd, and some
aren't necessary, so some lessons about arguments: most arguments come on
stack. On x86-64, the first 4 (win64) or 6 (unix64) come in registers, but
the rest (width, start, end) come on stack. On x86-32, all arguments come
on stack. So, if you get 9 arguments, you have 3 arguments at least on
stack, including width. That means you don't have to move width into r10q;
you can move widthmp (the stack version of this argument) into widthq at
the start of each row, since the system already put width on stack for you.
x86inc.asm moves it from stack into a register for you when you say cglobal
name, %d and %d >= 7 (where width is the 7th argument).
Then, you can also sub startmp from endq, which you can then store back
into endmp on x86-32, and suddenly on x86-32 you only need 7 regs (for
x86-64, you keep using endd since that's faster). And now, your simd works
on 32bit systems as well.
+ .loop:
> + movu m0, [topq + x]
> + movu m1, [bottomq + x]
> + mova m2, [pb_255]
> + psubusb m2, m1
pxor m1, [pb_255] should be the same as mova reg, [pb_255] and psubusb reg,
m1
Now, you're using pb_255 a lot inside your inner loop, and with pxor, you
only use it non-destructively, so why not move it into a register (m3)
outside the loop so you only load it from mem once?
Ronald
More information about the ffmpeg-devel
mailing list