[FFmpeg-devel] [PATCH] x86: hevc: adding transform_add
James Almer
jamrial at gmail.com
Wed Jul 30 23:04:30 CEST 2014
On 30/07/14 10:33 AM, Pierre Edouard Lepere wrote:
> +%macro TR_ADD_INIT_SSE_8 2
> + movu m4, [r1]
> + movu m6, [r1+16]
> + movu m8, [r1+32]
> + movu m10, [r1+48]
You can use mova here, and probably in every other movu as well.
> + lea %1, [%2*3]
> + pxor m5, m5
> + psubw m5, m4
> + packuswb m4, m4
> + packuswb m5, m5
> + pxor m7, m7
> + psubw m7, m6
> + packuswb m6, m6
> + packuswb m7, m7
> + pxor m9, m9
> + psubw m9, m8
> + packuswb m8, m8
> + packuswb m9, m9
> + pxor m11, m11
> + psubw m11, m10
> + packuswb m10, m10
> + packuswb m11, m11
> +%endmacro
>
> +%macro TR_ADD_OP_SSE 4
> + %1 m0, [%2 ]
> + %1 m1, [%2+%3 ]
> + %1 m2, [%2+%3*2]
> + %1 m3, [%2+%4 ]
> + paddusb m0, m4
> + paddusb m1, m6
> + paddusb m2, m8
> + paddusb m3, m10
> + psubusb m0, m5
> + psubusb m1, m7
> + psubusb m2, m9
> + psubusb m3, m11
> + %1 [%2 ], m0
> + %1 [%2+%3 ], m1
> + %1 [%2+2*%3], m2
> + %1 [%2+%4 ], m3
> +%endmacro
You can use packuswb to pack two regs into one, like you did in TR_ADD_INIT_SSE_16.
Then you simply use movq+movhps to load and store data, like so:
%macro TR_ADD_INIT_SSE_8 2
mova m4, [r1]
mova m6, [r1+16]
mova m0, [r1+32]
mova m2, [r1+48]
lea %1, [%2*3]
pxor m5, m5
psubw m5, m4
pxor m7, m7
psubw m7, m6
pxor m1, m1
psubw m1, m0
packuswb m4, m0
packuswb m5, m1
pxor m3, m3
psubw m3, m2
packuswb m6, m2
packuswb m7, m3
%endmacro
%macro TR_ADD_OP_SSE 4
movq m0, [%2 ]
movq m1, [%2+%3 ]
movhps m0, [%2+%3*2]
movhps m1, [%2+%4 ]
paddusb m0, m4
paddusb m1, m6
psubusb m0, m5
psubusb m1, m7
movq [%2 ], m0
movq [%2+%3 ], m1
movhps [%2+2*%3], m0
movhps [%2+%4 ], m1
%endmacro
This not only reduced the instruction count, but also made it use 8 xmm
regs instead of 12.
Reordering the instructions might prevent some dependencies as well.
The TR_ADD_OP_SSE macro as edited above will not work for
hevc_transform_add16_8 anymore, so you will have to duplicate it.
Haven't looked at hevc_transform_add16_8, but I'm sure it can be done
with less than 14 xmm registers.
More information about the ffmpeg-devel
mailing list