[FFmpeg-devel] Patch: Inline asm fixes for Intel compiler on Windows

Sun Apr 6 06:29:07 CEST 2014

On 6 April 2014 00:47, Reimar Döffinger <Reimar.Doeffinger at gmx.de> wrote:

> On Sun, Apr 06, 2014 at 12:21:54AM +1100, Matt Oliver wrote:
> > > If you do not initialize "bit", making it an output-only paramenter,
> > > you should use =&q instead of +q as constraint.
> > > However neither is correct as-is, setae sets only a single byte,
> > > thus not initializing it will result in uninitialized data,
> > > IOW the code as is is broken.
> > > If you want to do that kind of change, you would have to use uint8_t
> > > instead as type.
> >
> >
> > Interestingly both compilers I tested with didnt show any errors and had
> > the upper bits zeroed. However as the function returns a 32b int and any
> > code that calls this function would also assume 32b then a xor to set to
> 0
> > initially is still probably the way to go. Using uint8 would save that
> > instruction but would add one later to clear the upper bytes should it
> ever
> > be used for anything larger than 1B. So fixed to add back the set to 0.
>
> I think it is better like this since it's an unrelated change.
> However when it comes to performance it might be better to use uint8_t,
> since this code as-is is very likely to cause partial register stalls
> as far as I can tell.
> A movzx would increase code size, but not increase number of
> instructions and avoid this issue.
> It would also free a register if it was moved last and = instead of =&
> used (if the compiler manages to figure it out).
> Of course only benchmarking would tell if it really is an improvement.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>

The way it is currently (as of last patch) is 5.4% faster. However changing
it to a uint8 (and using = as suggested) is actually only 0.97% faster.
Whether a movzx is added explicitly to the asm or allowing the compiler to
do the convert on return results in identical performance.

The function could be changed to return a uint8 but on inspection only vp5
will benefit from this (it has a lot of if(rac_get_prob) that can work
directly on uint8) as all the others work directly on int's which would add
the conversion which benchmarking shows provided limited benefit. In fact I
tested it with the vp5 function and it didnt add any benefit over the
current patch.

So from the benchmark results it appears the patch as of the last update
provides the best performance.

On another note I was looking at the cabac functions in x86/cabac.h and at
present the x86 optimised get_cabac_inline is excluded by the
BROKEN_COMPILER define. However the functions in x86/h264_i386.h are not
excluded. I find this interesting as get_cabac_inline just has a single
implementation of the BRANCHLESS_GET_CABAC macro which is apparently
broken. While for instance the function decode_significance in h264_i386
has multiple BRANCHLESS_GET_CABAC's which I would have assumed would also
be broken. Should BROKEN_COMPILER also exclude these functions? icl is
broken under any function that calls BRANCHLESS_GET_CABAC so if the
existing BROKEN_COMPILER should also be used to exclude the functions in
h264_i386 then icl can just be added on to the existing define. otherwise a
new one must be added just for icl. Adding it to the existing one would be
cleaner and with the smallest amount of changes and I wonder if this is
actually the correct way anyway as I would assume the h264_i386 should be
marked as broken under those compilers anyway.