[FFmpeg-devel] [PATCH] lavfi/removegrain: add x86 and x86_64 SSE2 functions

Tue Jul 14 20:54:22 CEST 2015

On 2015-07-11 18:34, James Almer wrote:
> On 11/07/15 10:40 AM, James Darnley wrote:
>> @@ -566,7 +554,19 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
>>          }
>>  
>>          *dst++ = *src++;
>> -        for (x = 1; x < s->planewidth[i] - 1; x++) {
>> +
>> +        if (s->fl[i]) {
>> +            int w_asm = (s->planewidth[i] - 2) & ~15;
> 
> I don't thinks this belongs here. If you add an AVX2 version as you intended,
> wouldn't this need to be 31?
> 
> A wrapper function to have this code in the x86 folder may be the best option.

I was considering having the init_x86 function set a variable in the
context to do this math based on how many pixels are actually done per
iteration, meaning:
- 8 for the unpacked words and sse2
- 16 for the packed bytes and see2 or future unpacked words and avx2
- 32 for the future packed bytes and avx2

That would let it be simply re-used if someone wanted to try writing a
line function in C or other platform assembly.

>> new file mode 100644
>> index 0000000..5e1feea
>> --- /dev/null
>> +++ b/libavfilter/x86/vf_removegrain.asm
>> @@ -0,0 +1,1215 @@
>> +;*****************************************************************************
>> +;* x86-optimized functions for yadif filter
>> +;*
>> +;* Copyright (C) 2015 James Darnley
>> +;*
>> +;* This file is part of FFmpeg.
>> +;*
>> +;* TODO: gpl text goes here.
>> +;*****************************************************************************
>> +
>> +; column: -1  0 +1
>> +; row -1: a1 a2 a3
>> +; row  0: a4  c a5
>> +; row +1: a6 a7 a8
>> +
>> +%include "libavutil/x86/x86util.asm"
>> +
>> +SECTION_RODATA
> 
> Declare it SECTION_RODATA 32 now so you don't forget when you add AVX2 versions.

Okay, I'm confused by this.  It doesn't align the constants you define
on 32 byte boundaries if you happen to define the data incorrectly so...
 Why does this matter?  Or what does it do?

>> +
>> +; The loop doesn't need to do all the iterations.  It could stop when the right
>> +; pixels are in the right registers.
>> +%macro SORT_SQUARE 0
>> +    %assign k 7
>> +    %rep 7
>> +        %assign i 1
>> +        %assign j 2
>> +        %rep k
>> +            SORT_PAIR m %+ i , m %+ j , m9
>> +            %assign i i+1
>> +            %assign j j+1
>> +        %endrep
>> +        %assign k k-1
>> +    %endrep
>> +%endmacro
>> +
>> +; %1 dest simd register
>> +; %2 source (simd register/memory)
>> +; %3 temp simd register
>> +%macro ABS_DIFF 3
>> +    mova %3, %2
>> +    psubusb %3, %1
>> +    psubusb %1, %2
>> +    por %1, %3
>> +%endmacro
>> +
>> +; %1 dest simd register
>> +; %2 source (simd register/memory)
>> +; %3 temp simd register
>> +%macro ABS_DIFF_W 3
>> +    mova %3, %2
>> +    psubusw %3, %1
>> +    psubusw %1, %2
>> +    por %1, %3
>> +%endmacro
> 
> No way to achieve this using the pabs* ssse3 instructions?

Maybe.  I looked them up and see that I would need a subtraction anyway.
 I wonder if I could measure a speed change.  I need to check the
behaviour of them together with the subtract instructions.

>> +; %1 simd register that hold the mask and will hold the result
>> +; %2 simd register that holds the "true" values
>> +; %3 location of the "false" values (simd register/memory)
>> +%macro BLEND 3 ; mask, true, false
>> +    pand %2, %1
>> +    pandn %1, %3
>> +    por %1, %2
>> +%endmacro
> 
> This doesn't follow the sse4 blend instructions' semantic, so it will make adding AVX2
> versions harder.
> 
> Try instead
> 
> %macro BLEND 3 ; false, true, mask
>     pand      %2, %3
>     pandn     %3, %1
>     por       %3, %2
>     SWAP      %1, %3
> %endmacro
> 
> Which will let you use "vpblendvb %1, %1, %2, %3" for the avx2 version.
> SSE4's pblendvb is kinda crappy, requiring the mask to be in xmm0, so
> adding an SSE4 version may not be worth it.

I applied your patches that you sent on IRC (this and the sort macro
one).  Do you want me to put your name in the copyright header?

>> diff --git a/libavfilter/x86/vf_removegrain_init.c b/libavfilter/x86/vf_removegrain_init.c
>> new file mode 100644
>> index 0000000..450185f
>> --- /dev/null
>> +++ b/libavfilter/x86/vf_removegrain_init.c
>> @@ -0,0 +1,128 @@
>> +#include "libavutil/attributes.h"
>> +#include "libavutil/cpu.h"
>> +#include "libavutil/x86/cpu.h"
>> +#include "libavfilter/removegrain.h"
>> +
>> +void ff_rg_fl_mode_1_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_10_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_11_12_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_13_14_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_19_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_20_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_21_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_22_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +#if ARCH_X86_64
>> +void ff_rg_fl_mode_2_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_3_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_4_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_5_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_6_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_7_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_8_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_9_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_15_16_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_17_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_18_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_23_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +void ff_rg_fl_mode_24_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>> +#endif

A general question.  Is it preferred to have the function declarations
and assignments out of order to reduce the number of #if ARCH_X86_64
conditions?

>> +av_cold void ff_removegrain_init_x86(RemoveGrainContext *rg)
>> +{
>> +    int cpu_flags = av_get_cpu_flags();
>> +    int i;
>> +
>> +    for (i = 0; i < rg->nb_planes; i++) {
>> +        switch (rg->mode[i]) {
>> +            case 1:
>> +                if (EXTERNAL_SSE2(cpu_flags))
> 
> I think it will be much cleaner and simpler if you put the switch inside a single
> EXTERNAL_SSE2 branch, rather than checking it for every mode.

You're probably right.  I was going to see what it looks like after AVX2
and clean it up then but I can squash some changes into this patch if
people prefer it.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20150714/dd6f1561/attachment.sig>