[FFmpeg-devel] [PATCHES] lpc_mmx: merge some asm blocks and add xmm registers to clobber list

Ramiro Polla ramiro.polla
Sun Oct 31 14:43:31 CET 2010


On Sun, Oct 31, 2010 at 10:39 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Sat, Oct 30, 2010 at 3:46 PM, Ramiro Polla <ramiro.polla at gmail.com> wrote:
>> $subj, split in 2 patches
> [..]
>> --- a/libavcodec/x86/lpc_mmx.c
>> +++ b/libavcodec/x86/lpc_mmx.c
>> @@ -29,16 +29,15 @@ static void apply_welch_window_sse2(const int32_t *data, int len, double *w_data
>> ? ? ?x86_reg i = -n2*sizeof(int32_t);
>> ? ? ?x86_reg j = ?n2*sizeof(int32_t);
>> ? ? ?__asm__ volatile(
>> - ? ? ? ?"movsd ? %0, ? ? %%xmm7 ? ? ? ? ? ? ? ?\n\t"
>> + ? ? ? ?"movsd ? %4, ? ? %%xmm7 ? ? ? ? ? ? ? ?\n\t"
>> ? ? ? ? ?"movapd ?"MANGLE(ff_pd_1)", %%xmm6 ? ? \n\t"
>> ? ? ? ? ?"movapd ?"MANGLE(ff_pd_2)", %%xmm5 ? ? \n\t"
>> ? ? ? ? ?"movlhps %%xmm7, %%xmm7 ? ? ? ? ? ? ? ?\n\t"
>> ? ? ? ? ?"subpd ? %%xmm5, %%xmm7 ? ? ? ? ? ? ? ?\n\t"
>> ? ? ? ? ?"addsd ? %%xmm6, %%xmm7 ? ? ? ? ? ? ? ?\n\t"
>> - ? ? ? ?::"m"(c)
>> - ? ?);
>> + ? ? ? ?"bt ? ? ?$1, ? ? %5 ? ? ? ? ? ? ? ? ? ?\n\t"
>> + ? ? ? ?"jne ? ? 2f ? ? ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> ?#define WELCH(MOVPD, offset)\
>> - ? ?__asm__ volatile(\
>> ? ? ? ? ?"1: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\n\t"\
>> ? ? ? ? ?"movapd ? %%xmm7, ?%%xmm1 ? ? ? ? ? ? ?\n\t"\
>> ? ? ? ? ?"mulpd ? ?%%xmm1, ?%%xmm1 ? ? ? ? ? ? ?\n\t"\
>> @@ -55,13 +54,15 @@ static void apply_welch_window_sse2(const int32_t *data, int len, double *w_data
>> ? ? ? ? ?"sub ? ? ?$8, ? ? ?%1 ? ? ? ? ? ? ? ? ?\n\t"\
>> ? ? ? ? ?"add ? ? ?$8, ? ? ?%0 ? ? ? ? ? ? ? ? ?\n\t"\
>> ? ? ? ? ?"jl 1b ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \n\t"\
>> - ? ? ? ?:"+&r"(i), "+&r"(j)\
>> - ? ? ? ?:"r"(w_data+n2), "r"(data+n2)\
>> - ? ?);
>> - ? ?if(len&1)
>> +
>> ? ? ? ? ?WELCH("movupd", -1)
>> - ? ?else
>> + ? ? ? ?"jmp 3f ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> + ? ? ? ?"2: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> ? ? ? ? ?WELCH("movapd", -2)
>> + ? ? ? ?"3: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> + ? ? ? ?:"+&r"(i), "+&r"(j)
>> + ? ? ? ?:"r"(w_data+n2), "r"(data+n2), "m"(c), "r"(len)
>> + ? ?);
>> ?#undef WELCH
>> ?}
>
> <nag> What I don't like here is that you're changing it in a hardcoded
> (conditional) jump, like this:
>
> test bla
> jne else
> ; stuff code
> jmp endif
> .else
> ; stuff2 code
> .endif
> ret
>
> Which means that both branches will always have a jmp. In the C case
> (at least with a good gcc version), if you check the disassembly, gcc
> will generate code that does this:
>
> test bla
> jne else
> ; stuff code
> ret
> .else
> ; stuff2 code
> ret

This function is inlined into the other one. What I got (with gcc 4.4.4) is

test bla
jne else
(code1)
.rest
(rest of function)
ret
.else
(code2)
jmp rest

So it should be about the same thing.

> And as you can see, the first half is actually jmp-less. I've tested
> this and it's quite a bit faster, just the lack of the jmp in one half
> of the runtime. Combined with stuff like predicting which of the two
> is most likely to happen in the statistical sense, this can have a
> significant impact on performance. In yasm, you'd simply code a RET at
> the end of each branch. Can we do something similar in inline asm?

Isn't an unconditional jump predictable by the CPU? And if we write
our own ret then we'd have to explicitly save the contents of xmm
registers, which would beat the purpose of these changes. Perhaps
someone wants to write this in .asm afterwards =)



More information about the ffmpeg-devel mailing list