[FFmpeg-devel] [PATCH]v5 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Mon Jul 24 15:49:23 EEST 2017

On 7/23/17, Rostislav Pehlivanov <atomnuker at gmail.com> wrote:
> On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvachev at gmail.com> wrote:
>
>> This patch is ready for review and inclusion.
>>
>>
>>
>>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>>+%macro lea_pic_base 2; reg, const_label
> Capitalize macro names.

Done for the later.
About the former...

I do not like to use capitalization for the emulated instructions:

1. In C macros are always in capital letters to separate them from the
functions.
In ASM functions cannot be mistaken for macros.

2. All instructions are lowercase, even the ones that are overloaded
by x86asm.h macros.

3. There are already about 30 macros with lower cases in
libavcodec/x86. The rule is not absolute.

4. There are some emulated instructions in x86util, that are all upper
case names and
I do find them very ugly when I see them in the code.
This is why I went with "emu_<op>" route.
I actually want to encourage using the emu_<op> for them too (as
alternative names).

5. There is nothing guaranteeing that the assembler
should be case sensitive.
Actually nasm manual mentions that MASM (on which
it it is based on) is not case-sensitive (by default).
While nasm and yasm are case sensitive,
there are still %idefine and %imacro that
could create some confusion.

Anyway.

Would you accept a change where the emulation macros
are moved to a separate file and the macros are
renamed to EMU_<op> , aka EMU_blendvps ?
I'm thinking of "libavcodec/x86/x86emu.asm".

Or maybe they should be put in libavutil/x86/x86util.asm ,
however that could open a new can of worms.
AKA using the ugly uppercase only names.

>>+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
> Remove this error

I'd like to keep that one.
It helps finding out why the code does not compile.

Sometimes it may not be obvious, since SWAP might
change the registers under the hood.

>
>>+    %error "blendvps" emulation needs SSE
> Definitely remove this error too.

Done

>>+        %error AVX1 does not support integer 256bit ymm operations
> And this one is particularly pointless...

On the contrary. AVX1 does support 256 bit ymm float point operations
but not integer ones. It is quite trivial mistake to use one with the other.

What is worse, without this the wrong code would compile
without any hint of error, because avx2 codes are
not overloaded by the current x86asm.h
so you won't get warning that you are using avx2 ops in avx1 section.

>>+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
> Same...

Again, I'd like to keep that one.

>>+    %error "broadcastss" emulation needs SSE
> Same...

Done

>>+    %error "pbroadcastd" emulation needs SSE2
> Yep, the same...

Done

>
>>+    %error pick HSUMPS implementation
> Again...

I thought I had taken care of that.
Done

> All of these are pointless and unnecessary. Always assume at least SSE2.

NEVER assume anything (that you can check).
Making wrong assumptions is the easiest way
to make a mistakes that are
very hard to notice, find and correct.

>>+
>>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>>+; So disable it for now and keep the code in case something changes in
> future.
>>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>>+INIT_YMM avx2
>>+PVQ_FAST_SEARCH
>>+%endif
>
> Remove this altogether.

I think this would be a mistake.

BTW, would you do a proper benchmarks with the current
code and post the results. There is a chance that the
code has improved.
(Please, use a real audio sample, not generated noise).

I also asked you to try something (changing "cmpless" to "cmpps" ),
that you totally forgot to do.
(IACA says there is no penalty in my code for using this SSE op in AVX2 block,
but as I said, never assume.)

>>+%if 1
>>+    emu_pbroadcastd m1,   xmm1
> ...
>>+%else
>>+        ; Ryzen is the only CPU at the moment that doesn't have
>>+        ; write forwarding stall and where this code is faster
>>+        ; with about 7 cycles on avarage.
>>+        %{1}ps      m5,   mm_const_float_1
>>+        movss       [tmpY + r4q], xmm5      ; Y[max_idx] +-= 1.0;
>
> Remove the %else and always use the first bit of code.

I don't like removing code that is faster on a new CPU.
But since you insist.
Done.

>>+%if cpuflag(avx2) && 01
>>+%elif cpuflag(avx) && 01
>>+%if cpuflag(avx2) && 01
>
> Remove the && 01 in these 3 places.

Done

>>+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1
>
> Remove this commented out code.

Done

>
>>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>>+%if ARCH_X86_64 == 0    ;sbrdsp
>
> You're using more than 11 xmm regs, so the entire code is always going to
> need a 64 bit system.
> So wrap the entire file, from top to bottom after the %include in
> %if ARCH_X86_64
> <everything>
> %endif

I'm using exactly 8 registers on 32 bit arch.
I'm using 3 extra registers on 64 bit ones to hold some constants.

If you insist, I'll write some ifdef-ery to
always supply the correct number of registers.
>From what I've seen in x86asm, the >8 case is only checked on WIN64.

>>+SECTION_RODATA 64
>>+
>>+const_float_abs_mask:   times 8 dd 0x7fffffff
>>+const_align_abs_edge:   times 8 dd 0
>>+
>>+const_float_0_5:        times 8 dd 0.5
>>+const_float_1:          times 8 dd 1.0
>>+const_float_sign_mask:  times 8 dd 0x80000000
>>+
>>+const_int32_offsets:
>>+                        %rep 8
>>+                                dd $-const_int32_offsets
>>+                        %endrep
>>+SECTION .text
>
> This whole thing needs to go right at the very top after the %include. All
> macros must then follow it.
> Having read only sections in the middle of the file is very confusing and
> not at all what all of our asm code follows.

It's very simple:
  pre-processor, constants, code.
Your confusion comes from the fact that
the code templates are done by macros.

As I've told you before, the macros before
the constants belong to a separate file.

Also why are you so obsessed
with constants been the first thing in a file.
You are not supposed to mess with them
on regular basis.

Constant labels must be before the code that
uses them, but because of code templates
they could be literally few lines before the end of the file.

Best Regards.