[Ffmpeg-devel] [PATCH] lowres chroma bug

Thu Feb 8 14:32:58 CET 2007

On Thu, 8 Feb 2007, Michael Niedermayer wrote:
> On Wed, Feb 07, 2007 at 03:51:22PM -0800, Trent Piepho wrote:
> > >
> > > last time i compared hardcoded registers with gcc-choosen ones, the later
> > > where slower (that was in cabac.h in case you want to proof me wrong, id
> > > be happy if we could get rid of the hardcoded registers there ...)
> >
> > It going to depend a lot on how the code is used.  If your asm will only
> > appear in one place, ie. it's neither a macro nor an inlined function nor
> > in a unrolled loop, etc., the you could just let gcc pick a register and
> > then go back and hardcode that same register.  That should generate the
> > exact same code.
>
> i agree, it should but iam not so sure if it really does if you need
> additional dummy variables for the gcc choosen register case ...

You can see in the resulting code that gcc doesn't generate any loads or
stores to the dummy variable, or even allocate any stack space for it.

> > The advantage comes when the code is a macro or inlined in multiple places.
> > With a hard coded register, the same register must be used each time.  If
> > you let gcc choose, it can pick different registers depending on the
> > context.  In this case, no matter what register you pick, you may do worse
> > than letting gcc pick.
>
> in theory yes, in practice i dont have that much faith in gccs ability to
> select registers better then doing random assignment, and forcing
> input operands to be always in the same register compared to random ones
> can avoid some instrucions

At least in simple cases, it is easy to see the gcc register assignment is
much better than random.  Here's an example:
#include <string.h>
int foo()
{
    int a, b;
    void *d, *s;

    asm("# a = %0, b = %1" : "=r"(a), "=r"(b));     /*block 1*/
    bar(a);
    asm("# read a = %0 b = %1" :: "r"(a), "r"(b));

    asm("# s = %0, d = %1" : "=r"(s), "=r"(d));     /*block 2*/
    bzero(d, 32);

    asm("# a = %0, b = %1" : "=r"(a), "=r"(b) : "r"(s)); /*block 3*/
    return a;
}

In block 1, a and b need to keep their values across the call to bar().
gcc generates:
        # a = %ebx, b = %esi    # a, b
        pushl   %ebx    # a
        call    bar     #
        # read a = %ebx b = %esi        # a, b

It choose ebx and esi because those are callee saved registers and do not
need to be saved and re-loaded across the call to bar().  If the call to
bar() is commented out, it will choose edx and eax instead.

In block 2, gcc will emit an inline version of bzero using rep stosl, which
must write to the address edi, and so gcc will assign edi to d.  Change the
bzero to use s or a or b, and then that variable will be assigned edi.
Comment out the bzero, and gcc will just use eax/edx.

In block 3, a is the return value of the function and so will be put in eax
since that's where the return value needs to go.  Change the function to
return b, and then b will get put in eax.

> > Like the inlined put_bits() function in bitstream.h, I think you would get
> > better code if the eax wasn't hardcoded.
>
> well benchmark it and send a patch if its faster

I have no idea how to benchmark that function.  Adding an rdtsc to the code
will totally change the register allocation since it clobbers eax and edx.
Also, better register allocation doesn't make the asm code itself any
faster, the instructions are the same no matter which register they use.
Rather, it makes the code around the asm block faster.  So, you would need
to benchmark all the code that put_bits() is inlined into.  How could that
be done?  You could benchmark the entire program, but I doubt a bit better
code in put_bits() would be measurable against everything else.