[FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Fri Aug 4 00:36:43 EEST 2017

On 8/2/17, Henrik Gramner <henrik at gramner.com> wrote:
> On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachev <ikalvachev at gmail.com>
> wrote:
>> On 7/31/17, Henrik Gramner <henrik at gramner.com> wrote:
>>> Use rN instead of rNq for numbered registers (q suffix is used for
>>> named args only due to preprocessor limitations).
>>
>> Is this documented?
>
> Not sure, but there's probably a bunch small things like this that
> aren't really well documented.
>
>>> Use the same "standard" vertical alignment rules as most existing
>>> code, e.g. instructions indented by 4 spaces and operands aligned on
>>> the first comma.
>>
>> What do you mean operands aligned on the first comma?
>> The longest operand I had is "xmm0" , counting comma and space
>> I get 6 position alignment for operands.
>> Now, with "xm7" i can lower this to 5 position. Should I do that?
>> (I don't have "xm15" ).
>>
>
> 1234_____________1234_1234_123
>     VBROADCASTSS ym1, xm1
>     BLENDVPS      m1, m2, m3
>
> is the most commonly used alignment.

I see that a lot of .asm files use different alignments.
I'll try to pick something similar that I find good looking.

I also see that different function/macro can have
different position for the first comma.
This could be quite useful, to visually separate
the macros.

>>> Unless aligning every single loop entry helps a lot I'd avoid it since
>>> it does waste a bit of icache.
>>
>> I'll redo my benchmarks, but I have placed these aligns for a reason.
>> At some point removed debug instructions started making my code
>> slower. So I've placed align to avoid random slowdowns.
>
> Ok.
>
>>> Explicitly using the carry flag as a branch condition is a bit weird.
>>> Generally using jg/jl/jge/jle tends to be clearer.
>>
>> I use it to take advantage of the so called MacroFusion.
>> It allows the merge of cmp and jxx, as long as the branch
>> checks only one flag, so all signed branches are to be avoided
>> (as stated by intel manual).
>> The newer the cpu, the more opcodes (add/sub)
>> could be merged and less limitations.
>
> Every µarch that can do macro-op fusion with add/sub can do so with
> both signed and unsigned branch conditions.
> There's actually only a single µarch that can fuse cmp with unsigned
> but not signed branch conditions and that's more than a decade old
> (Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb'
> is preferred of 'jc' (both produce the exact same opcode).

One is enough reason for me :}
Since the same thing works just as well or better on newer CPU's.

Also, using above/bellow on arithmetic operations
is a lot more confusing than carry/borrow.
You can see that I use "jc" and "jnc" after "sub".

I'll keep it as it is, unless you have solid reason to change it.

>>> Assembler-specific ifdeffery should be avoided in individual files.
>>> Something equivalent already exists in x86inc actually but I don't
>>> remember if it was merged to FFmpeg from upstream (x264) yet.
>>
>> Definitely not merged.
>>
>> I hear a lot about the improved x86inc,
>> maybe it is time for you to merge it in FFmpeg?
>
> Now that I think about it I believe that part wasn't merged because
> smartalign is broken on some older nasm versions (which is another
> reason for avoiding things like this in individual files) and FFmpeg
> doesn't enforce any specific version requirements. I guess it could be
> worked around with some version checking ifdeffery if we know which
> versions are affected.

I don't see anything relevant in here:
http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
I also didn't notice anything similar in the changelog.

I don't think this is relevant reason for not merging avx2 support. ;)

>>> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
>>> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?
>>
>> Yes.
>> Do you know a way that works with "times 8"?
>> I've done it this way to make it easy to change the size of the constant.
>>
>> Do you request I change it?
>
> I'd prefer just using that simple one-liner I posted. Replacing the
> numbers if you want to change things later is trivial.

It might be trivial, but it also makes it easier to misspell something.
I'd prefer to keep it as it is.

>>> Is reordering moves for the non-AVX path worth the additional
>>> complexity? Microoptimizations that only affect legacy hardware are
>>> IMO a bit questionable.
>>
>> Yes, I'm on "legacy" hardware and the improvement is reliably measurable.
>
> Ok, in that case you should also use shufps/addps instead of
> vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc
> performs automatic VEX-encoding).

Only on instructions it knows...
that brings us again to the AVX2.

I had it mixed before, but I changed it to be consistent.
Some new avx instruction are not overloaded or
available without their "v-" prefix.
e.g. x86util uses vpbroadcastw in SPLATW.

It makes more sense to use "v" prefix for
code that is guaranteed to be avx.

>>> The AVX1 integer warning is kind of silly and doesn't really serve any
>>> purpose.
>>
>> As I've said in this thread before,
>> If you use avx1 ymm with this macro
>> you won't get any warning that avx2 opcode has been generated
>> because x86inc does not overload avx2.
>
> Sure, but using any other AVX2 instruction in AVX functions wont
> result in any warnings either so it doesn't make much sense to do it
> in this specific case.

Isn't one of the selling points for x86inc that
it would warn if unsupported instructions are used?

To be honest, at first I had that function use
equivalent logical operation from the float point domain.
It does impose penalty on domain transition and
I do think that doing that behind developer back is bad idea.

This is why I think that this error should remain.

At least, until usage of avx2 in avx1 code gives errors.
At that point the error would be redundant and could
be removed.

>>> Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking
>>> something.
>>
>> The x86util VBROADCASTSS macro implements only
>> the avx1 variant that is memory to register.
>> My variant emulates the avx2 variant,
>> that also does reg to reg.
>>
>> My variant also avoids write dependency on "movss reg, reg".
>> ("movss reg, mem" clears the other elements, but
>> "movss reg, reg" preserves them. This creates dependency on the old
>> values of the register.)
>>
>> And yes, I did ask if I should separate
>> these macros in another file and
>> if I should try to merge them in x86util.asm first.
>>
>> What do you think on the matter?
>
> Add your improvements to the existing x86util macro (can be done in
> the same patch) and then use that.

FFmpeg guidelines request that it is a separate patch in separate mail.

>>> +%macro EMU_pbroadcastd 2 ; dst, src
>>> +%if cpuflag(avx2)
>>> +       vpbroadcastd %1,   %2
>>> +%elif cpuflag(avx) && mmsize >= 32
>>> +        %error AVX1 does not support integer 256 bit ymm operations
>>> +%else
>>> +  %ifnum sizeof%2       ; sse2 register
>>> +        pshufd      %1,   %2,   q0000
>>> +  %else                 ; sse memory
>>> +        movd        %1,   %2            ; movd zero extends
>>> +        pshufd      %1,   %1,   0
>>> +  %endif
>>> +%endif
>>> +%endmacro
>>>
>>> Same as above, but using the SPLATD macro.
>>
>> Again, SPLATD does not handle avx2.
>> Also I do think it is better to use a name of existing instruction
>> instead of a made up one.
>
> Same as above. Yes, I agree that the current naming is inconsistent
> (that macro was created before broadcast instructions existed) and I'm
> OK with renaming it, but if so it should be done in a separate patch.

Just renaming it is not enough, because SPLATD
works differently (it can broadcast second, third or forth element too)
and it is already used in existing code.
Maybe PBROADCASTD macro could use SPLATD internally.

>>> Use the actual cmpss instruction instead of psuedo-instructions.
>>> x86inc doesn't support automatic VEX handling for every possible
>>> psuedo-instruction (mainly because there's so many of them) which will
>>> result in AVX state transitions if INIT_YMM is used.
>>
>> Not according to Intel Architecture Code Analyzer
>> and the benchmarks I've got.
>> The benchmarks showed that using cmpps is about 5% slower on Ryzen
>> and absolutely same speed on SkyLake.
>> That's in avx2 ymm.
>
> cmpss, not cmpps. "cmpless a, b" is a psuedo-instruction which the
> assembler will convert into the actual instruction "cmpss a, b, 2".

"cmpps" is not scalar, but it is handled by x86inc macros and converted
to "vcmpps", so no SSE penalties are possible with it.

Read again what I wrote previously.

> One could of course argue whether or not x86inc should deal with
> psuedo-instructions but right now it doesn't so it will cause AVX
> state transitions if INIT_YMM is used since it wont be VEX-encoded.

Let's say that using a number for selecting comparison mode
is NOT developer friendly.

>>> This macro is only ever used once, remove and inline it to reduce
>>> complexity.
>>
>> Can I move it to x86util too?
>>
>> Unified handling of PIC mangling is something welcomed.
>> I do think this is the right way. ;)
>> I think that at least one other file had something very similar.
>
> Doing PIC-mangling optimally will in many cases be context-dependent
> which is probably why it doesn't already exist. For example if you
> need to do multiple complex PIC-loads you should do a single lea and
> reuse it which is harder to convey as a generic macro.

I think that this macro can help with that too:
e.g.
    SET_PIC_BASE lea,  r5, const_1
    movaps        m1,  [pic_base_const_1 + r2 ]
    movaps        m2,  [pic_base_const_1 + const_2 - const_1 + r2]

I guess you might want a second macro, something like:
    %define pic_mangle(a,b) (pic_base_%+ a + b - a)
    movaps        m2,  [pic_mangle(const_1, const_2) ]

>>> IIRC the number of used mmregs is ignored in 32-bit mode so I believe
>>> you can unconditionally specify 11.
>>
>> Is that documented and guaranteed to not change in future?
>> I had it at 11 unconditionally. But then decided to play safe.
>
> There are no callee-saved xmm registers or anything else in x86-32
> that would require keeping track of how many of them are clobbered.

OK, will "simplify" it then.

>>> Potential use case for 2x FMA instead of 2x mul+add.
>>
>> FMA is newer than AVX1. To actually use the opcodes
>> I'd need to have 4'th version just for it or use avx2.
>>
>> Without possibility of using FMA, the macro usage
>> would be just an obfuscation.
>>
>> Do you insist on it?
>
> I wouldn't bother with a separate FMA version but if you end up doing
> AVX2 it should definitely be done as part of it (AVX2 is a superset of
> FMA3).
>
>>>> +%%restore_sign_loop:
>>>> +        movaps      m0,   [tmpY + r4q]  ; m0 = Y[i]
>>>> +        movups      m1,   [inXq + r4q]  ; m1 = X[i]
>>>> +        andps       m1,   m2            ; m1 = sign(X[i])
>>>> +        orps        m0,   m1            ; m0 = Y[i]*sign
>>>> +        cvtps2dq    m3,   m0            ; m3 = (int)m0
>>>> +        movaps      [outYq + r4q], m3
>>>
>>> The orps instruction could be done using a memory operand instead of
>>> movaps+orps.
>>
>> I think I had tried that, but it was slower on my CPU.
>
> That's quite weird if it's the case. It's literally the same amount of
> µops just with a tiny bit shorter code size, so I would probably chalk
> it up to random differences in code alignment or something. Either way
> it's really not important so feel free to keep it as it is.

I don't leave alignment to the chance ;)

Have in mind, uops have different latency so
the order still matters.

I'll keep it.

>>> The reg-reg movaps on x64 could be optimized away by reordering
>>> registers so that the return value ends up in m0 instead of m6 from
>>> the start.
>>
>> No, it cannot.
>> I need m0 for the blendv mask on sse4.
>
> Oh right, I'm dumb. Never mind.
>
>>> The code has a bunch of if conditions for ymm regs but ymm regs are
>>> never actually used?
>>
>> I was asked to remove the INIT_YMM avx2 .
>>
>> The benchmarks showed that the resulting code is
>> slightly slower in the default case, so I had the code disabled.
>> But a lot of people cannot stand having disabled code around. :|
>>
>> If you want to help, you might try to find out why the code is slower.
>> My main theory is that 8 floats at once might be just too much,
>> 90% of the searches are with N<32.
>>
>> I don't have avx2 capable cpu to play with.
>
> Likely due to the cmpless AVX state transition I mentioned earlier.
> Could be worth re-testing once that is taken care of.

No, it is not.

The cmpps version definitely got converted to vcmpps
and no SSE penalty could have been imposed on it.

The "vcmpps" had absolutely the same speed on Skylake.
On Ryzen it was slower.

The reason for the slowdown is somewhere else.

As for now,
should I bring back the disabled "INIT_YMM avx2"
or leave it as it is?

Best Regards