[FFmpeg-devel] Extend/optimize RGB to RGB conversions funcs into rgb2rgb.c

Mon Sep 10 03:29:37 CEST 2012

Here are the asm code for the loops of the original and modified rgb32to24()
funcs as generated by gcc with the -O9 and -S parameters :

rgb32to24_original:

.L4:
	movzbl	2(%ecx,%eax,4), %ebx
	movb	%bl, (%edx)
	movzbl	1(%ecx,%eax,4), %ebx
	movb	%bl, 1(%edx)
	movzbl	(%ecx,%eax,4), %ebx
	addl	$1, %eax
	movb	%bl, 2(%edx)
	addl	$3, %edx
	cmpl	%esi, %eax
	jne	.L4

rgb32to24_modified:

.L21:
	movzbl	2(%edx), %ecx
	movb	%cl, (%eax)
	movzbl	1(%edx), %ecx
	movb	%cl, 1(%eax)
	movzbl	(%edx), %ecx
	addl	$4, %edx
	movb	%cl, 2(%eax)
	addl	$3, %eax
	cmpl	%ebx, %eax
	jne	.L21

The modified loop seem to be "less complex"/"more direct" than the original
(so, in first look, have a lot of chances to be more fast than the original)
**but** the original loop seem always more speed in my tests :(

Someone have an idea why ?
(I have only give internals loops but initialisations and endings of this two
funcs seems relatively  similars, so I don't think the difference can to be at
this level)

@+
Yannoo

Selon yann.lepetitcorps at free.fr:

> With a bigger number of tests/iterations, results are very less fluctuants
>
>
> RGB->RGBA and RGBA->RGB conversions tests (npixels=1024 niters=65536)
>
> Test original rgb24to32() func : 182 ms
> Test new rgb24to32()_alpha func : 177 ms
> Test original rgba32to24() func : 138 ms
> Test modified rgba32to24() func : 142 ms
>
> rgb24to32() : original=182ms modified=177ms (5ms 2.82%)
> rgba32to24() : original=138ms modified=142ms (-4ms -2.82%)
>
> The new rgb24to32_alpha() func is more speed than the original rgb24to32(),
> with
> the alpha handling for free :)
>
> But at the inverse the modified rgba32to24() is a less speed than the
> original
> version :(
> => I take tomorrow a look at the asm output for to understand exactly why ...
>
>
> @+
> Yannoo
>
>
> Selon yann.lepetitcorps at free.fr:
>
> > Exact, I have rebench it but with -O9 parameter on GCC and the runtime
> > difference between to originals and new versions is relatively small :
> >
> > Test original rgb24to32() func : 28 ms
> > Test new rgb24to32_alpha() func : 28 ms
> > Test original rgba32to24() func : 24 ms
> > Test modified rgba32to24() func : 23 ms
> >
> > rgb24to32() : original=28ms modified=28ms (0ms 0.00%)
> >
> > rgba32to24() : original=24ms modified=23ms (1ms 4.35%)
> >
> > Note that results are relatively fluctuant with diiferences between -15%
> and
> > +15%
> > (the "new" rgba32to24() seem generally more fast than the "old" but the new
> > rgb24to32_alpha() is regulary less fast than rgb24to32() [but it handle the
> > alpha parameter where rgb24to32() always set the alpha to 255)
> >
> >
> > @+
> > Yannoo
> >
> >
> >
> > Selon Loren Merritt <lorenm at u.washington.edu>:
> >
> > > On Mon, 10 Sep 2012, yann.lepetitcorps at free.fr wrote:
> > > > Selon Reimar Döffinger <Reimar.Doeffinger at gmx.de>:
> > > >
> > > >> Though one thing I wonder is why exactly that is faster, and why your
> > > >> compiler can't figure out how to optimize it on its own.
> > > >> There is also a bit the issue that compared to NEON-optimizing the
> code
> > > >> this is rather a very minor optimization.
> > > >
> > > > I think that is a little more speed because of this :
> > > >
> > > > -        dst[3 * i + 0] = src[4 * i + 2];
> > > > -        dst[3 * i + 1] = src[4 * i + 1];
> > > > -        dst[3 * i + 2] = src[4 * i + 0];
> > > >
> > > > +        dst[0] = psrc[2];
> > > > +        dst[1] = psrc[1];
> > > > +        dst[2] = psrc[0];
> > > >
> > > > => the copy is make with a "direct" adressing, cf. without
> > multiplications
> > > or
> > > > additions into the [] array adressing
> > > > (can the compilator handle automaticaly the * 3 multiplication for free
> > ?)
> > >
> > > It's not that a *3 is free, but rather that the addressing mode of the
> > > generated instructions doesn't have to be the same as the one in the
> > > source code. GCC is normally capable of switching from index variables to
> > > pointer incrementing or vice versa, though it doesn't always choose
> > > optimally when to do so.
> > >
> > > --Loren Merritt
> >
> >
> > _______________________________________________
> > ffmpeg-devel mailing list
> > ffmpeg-devel at ffmpeg.org
> > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>