[FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Henrik Gramner henrik at gramner.com
Wed Aug 2 18:55:07 EEST 2017


On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachev <ikalvachev at gmail.com> wrote:
> On 7/31/17, Henrik Gramner <henrik at gramner.com> wrote:
>> Use rN instead of rNq for numbered registers (q suffix is used for
>> named args only due to preprocessor limitations).
>
> Is this documented?

Not sure, but there's probably a bunch small things like this that
aren't really well documented.

>> Use the same "standard" vertical alignment rules as most existing
>> code, e.g. instructions indented by 4 spaces and operands aligned on
>> the first comma.
>
> What do you mean operands aligned on the first comma?
> The longest operand I had is "xmm0" , counting comma and space
> I get 6 position alignment for operands.
> Now, with "xm7" i can lower this to 5 position. Should I do that?
> (I don't have "xm15" ).
>

1234_____________1234_1234_123
    VBROADCASTSS ym1, xm1
    BLENDVPS      m1, m2, m3

is the most commonly used alignment.

>> Unless aligning every single loop entry helps a lot I'd avoid it since
>> it does waste a bit of icache.
>
> I'll redo my benchmarks, but I have placed these aligns for a reason.
> At some point removed debug instructions started making my code
> slower. So I've placed align to avoid random slowdowns.

Ok.

>> Explicitly using the carry flag as a branch condition is a bit weird.
>> Generally using jg/jl/jge/jle tends to be clearer.
>
> I use it to take advantage of the so called MacroFusion.
> It allows the merge of cmp and jxx, as long as the branch
> checks only one flag, so all signed branches are to be avoided
> (as stated by intel manual).
> The newer the cpu, the more opcodes (add/sub)
> could be merged and less limitations.

Every µarch that can do macro-op fusion with add/sub can do so with
both signed and unsigned branch conditions.

There's actually only a single µarch that can fuse cmp with unsigned
but not signed branch conditions and that's more than a decade old
(Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb'
is preferred of 'jc' (both produce the exact same opcode).

>> Assembler-specific ifdeffery should be avoided in individual files.
>> Something equivalent already exists in x86inc actually but I don't
>> remember if it was merged to FFmpeg from upstream (x264) yet.
>
> Definitely not merged.
>
> I hear a lot about the improved x86inc,
> maybe it is time for you to merge it in FFmpeg?

Now that I think about it I believe that part wasn't merged because
smartalign is broken on some older nasm versions (which is another
reason for avoiding things like this in individual files) and FFmpeg
doesn't enforce any specific version requirements. I guess it could be
worked around with some version checking ifdeffery if we know which
versions are affected.

>> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
>> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?
>
> Yes.
> Do you know a way that works with "times 8"?
> I've done it this way to make it easy to change the size of the constant.
>
> Do you request I change it?

I'd prefer just using that simple one-liner I posted. Replacing the
numbers if you want to change things later is trivial.

>> Is reordering moves for the non-AVX path worth the additional
>> complexity? Microoptimizations that only affect legacy hardware are
>> IMO a bit questionable.
>
> Yes, I'm on "legacy" hardware and the improvement is reliably measurable.

Ok, in that case you should also use shufps/addps instead of
vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc
performs automatic VEX-encoding).

>> The AVX1 integer warning is kind of silly and doesn't really serve any
>> purpose.
>
> As I've said in this thread before,
> If you use avx1 ymm with this macro
> you won't get any warning that avx2 opcode has been generated
> because x86inc does not overload avx2.

Sure, but using any other AVX2 instruction in AVX functions wont
result in any warnings either so it doesn't make much sense to do it
in this specific case.

>> Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking
>> something.
>
> The x86util VBROADCASTSS macro implements only
> the avx1 variant that is memory to register.
> My variant emulates the avx2 variant,
> that also does reg to reg.
>
> My variant also avoids write dependency on "movss reg, reg".
> ("movss reg, mem" clears the other elements, but
> "movss reg, reg" preserves them. This creates dependency on the old
> values of the register.)
>
> And yes, I did ask if I should separate
> these macros in another file and
> if I should try to merge them in x86util.asm first.
>
> What do you think on the matter?

Add your improvements to the existing x86util macro (can be done in
the same patch) and then use that.

>> +%macro EMU_pbroadcastd 2 ; dst, src
>> +%if cpuflag(avx2)
>> +       vpbroadcastd %1,   %2
>> +%elif cpuflag(avx) && mmsize >= 32
>> +        %error AVX1 does not support integer 256 bit ymm operations
>> +%else
>> +  %ifnum sizeof%2       ; sse2 register
>> +        pshufd      %1,   %2,   q0000
>> +  %else                 ; sse memory
>> +        movd        %1,   %2            ; movd zero extends
>> +        pshufd      %1,   %1,   0
>> +  %endif
>> +%endif
>> +%endmacro
>>
>> Same as above, but using the SPLATD macro.
>
> Again, SPLATD does not handle avx2.
> Also I do think it is better to use a name of existing instruction
> instead of a made up one.

Same as above. Yes, I agree that the current naming is inconsistent
(that macro was created before broadcast instructions existed) and I'm
OK with renaming it, but if so it should be done in a separate patch.

>> Use the actual cmpss instruction instead of psuedo-instructions.
>> x86inc doesn't support automatic VEX handling for every possible
>> psuedo-instruction (mainly because there's so many of them) which will
>> result in AVX state transitions if INIT_YMM is used.
>
> Not according to Intel Architecture Code Analyzer
> and the benchmarks I've got.
> The benchmarks showed that using cmpps is about 5% slower on Ryzen
> and absolutely same speed on SkyLake.
> That's in avx2 ymm.

cmpss, not cmpps. "cmpless a, b" is a psuedo-instruction which the
assembler will convert into the actual instruction "cmpss a, b, 2".

One could of course argue whether or not x86inc should deal with
psuedo-instructions but right now it doesn't so it will cause AVX
state transitions if INIT_YMM is used since it wont be VEX-encoded.

>> This macro is only ever used once, remove and inline it to reduce
>> complexity.
>
> Can I move it to x86util too?
>
> Unified handling of PIC mangling is something welcomed.
> I do think this is the right way. ;)
> I think that at least one other file had something very similar.

Doing PIC-mangling optimally will in many cases be context-dependent
which is probably why it doesn't already exist. For example if you
need to do multiple complex PIC-loads you should do a single lea and
reuse it which is harder to convey as a generic macro.

>> IIRC the number of used mmregs is ignored in 32-bit mode so I believe
>> you can unconditionally specify 11.
>
> Is that documented and guaranteed to not change in future?
> I had it at 11 unconditionally. But then decided to play safe.

There are no callee-saved xmm registers or anything else in x86-32
that would require keeping track of how many of them are clobbered.

>> Potential use case for 2x FMA instead of 2x mul+add.
>
> FMA is newer than AVX1. To actually use the opcodes
> I'd need to have 4'th version just for it or use avx2.
>
> Without possibility of using FMA, the macro usage
> would be just an obfuscation.
>
> Do you insist on it?

I wouldn't bother with a separate FMA version but if you end up doing
AVX2 it should definitely be done as part of it (AVX2 is a superset of
FMA3).

>>> +%%restore_sign_loop:
>>> +        movaps      m0,   [tmpY + r4q]  ; m0 = Y[i]
>>> +        movups      m1,   [inXq + r4q]  ; m1 = X[i]
>>> +        andps       m1,   m2            ; m1 = sign(X[i])
>>> +        orps        m0,   m1            ; m0 = Y[i]*sign
>>> +        cvtps2dq    m3,   m0            ; m3 = (int)m0
>>> +        movaps      [outYq + r4q], m3
>>
>> The orps instruction could be done using a memory operand instead of
>> movaps+orps.
>
> I think I had tried that, but it was slower on my CPU.

That's quite weird if it's the case. It's literally the same amount of
µops just with a tiny bit shorter code size, so I would probably chalk
it up to random differences in code alignment or something. Either way
it's really not important so feel free to keep it as it is.

>> The reg-reg movaps on x64 could be optimized away by reordering
>> registers so that the return value ends up in m0 instead of m6 from
>> the start.
>
> No, it cannot.
> I need m0 for the blendv mask on sse4.

Oh right, I'm dumb. Never mind.

>> The code has a bunch of if conditions for ymm regs but ymm regs are
>> never actually used?
>
> I was asked to remove the INIT_YMM avx2 .
>
> The benchmarks showed that the resulting code is
> slightly slower in the default case, so I had the code disabled.
> But a lot of people cannot stand having disabled code around. :|
>
> If you want to help, you might try to find out why the code is slower.
> My main theory is that 8 floats at once might be just too much,
> 90% of the searches are with N<32.
>
> I don't have avx2 capable cpu to play with.

Likely due to the cmpless AVX state transition I mentioned earlier.
Could be worth re-testing once that is taken care of.


More information about the ffmpeg-devel mailing list