[FFmpeg-devel] [PATCH] Fix function parameters for rgb48 to YV12 functions.
Wed Feb 3 02:30:24 CET 2010
On Tue, Feb 2, 2010 at 5:42 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Tue, Feb 02, 2010 at 08:21:15PM +0100, Reimar D?ffinger wrote:
>> On Tue, Feb 02, 2010 at 08:01:26PM +0100, Michael Niedermayer wrote:
>> > On Tue, Feb 02, 2010 at 04:10:06PM -0200, Ramiro Polla wrote:
>> > > Hello Michael,
>> > >
>> > > On Sun, Jan 24, 2010 at 8:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > > > the gain happens when you change the variables used to calculate the index
>> > > > also to it. You could also try to make the index unsigned but make sure it
>> > > > cant be negative if you try this
>> > >
>> > > Sorry but I still don't understand how that will be of use here in
>> > > libswscale. I've tried forcing int32_t and int64_t for x86_64 in some
>> > > of those functions (some xxxTo(Y|UV), hScale and the fast bilinear
>> > > ones), in all C, MMX and MMX2. All I can see is the expansion from
>> > > 32-bit to 64-bit being changed from caller and callee. There is no
>> > > difference in the inner loop, nor in how gcc addresses the the src and
>> > > dst arrays.
>> > maybe theres no gain for swscale, i cant say without looking at the asm
>> > gcc generates.
>> > i know that in h264 gcc filled some functions with 32->64 sign extension
>> > code in the inner loops.
>> Which compilation options have you been using?
> default of ffmpeg & gcc-4.4
> also a quick
> grep movslq libavcodec/h264_loopfilter.S | grep -v '('
> ? ?68 ? ? 136 ? ?1220
> and with -mtune=core2 -march=core2 -mcpu=core2
> grep movslq libavcodec/h264_loopfilter.S | grep -v '(' |wc
> ? ?68 ? ? 136 ? ?1226
> so no, its not helping it still does produce all the register-register
> sign extensions
Hmm, I think I understand now what you mean... This is what the asm of
some functions look like when things get changes from long to int.
I'll put the sizes of some functions as in <name> <size with long>
<size with int> <int - long>, along with their differences (mostly
only prologues). All tested with gcc 4.4.1 from ubuntu 9.10:
nv12ToUV_MMX 77 87 10
BEToUV_MMX 84 90 6
and similar _MMX functions.
bgr24ToUV_half_3DNow 142 172 30
jle 1cd8a <bgr24ToUV_half_3DNow+0xaa>
jle 1cb1c <bgr24ToUV_half_3DNow+0x8c>
rgb32ToUV 139 143 4
all hyscale_fast functions have only one more movslq in the int version.
Then many have this difference where the int version uses sub and lea
while the long version uses either add %reg,%reg or shl $2, %reg.
abgrToA 37 45 8
BEToUV_C 51 51 0
nv12ToUV_C 48 48 0
rgb15ToUV 141 143 2
long uses rbx (as in it pushes and pops rbx) while the int version
doesn't, long accesses arrays with movzwl (%rdx,%rax,1),%r9d instead
of movzwl (%rdx),%ecx in the inner loop (I don't know what difference
this makes). long uses add %r8,%r8 instead of sub & lea.
rgb15ToUV_half 166 174 8
Very few functions are larger with long such as:
rgb15ToY 95 84 -11
int uses sub & lea instead of add. long uses more 64-bit registers so
the instructions are larger.
And on to the caller,
swScale_C 10319 10082 -237
long has 9 more movslq, uses more stack
I haven't checked all functions though.
The final size (with runtime cpudetect):
The number of movslq between registers:
$ objdump -d swscale_ints.o | grep movslq | grep -v "(" | wc -l
$ objdump -d swscale_longs.o | grep movslq | grep -v "(" | wc -l
No speed differences were ever noticed. Dark_Shikari tells me a movslq
between registers is 1uop...
As for other architectures, the arm and ppc I have would have made no
difference since they're not 64-bit.
I've attached a patch which adds an array_index type, if that's what
you had in mind.
Otherwise I really don't know what to do. Long is being misused here,
and breaks compilation on mingw-w64.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 18604 bytes
Desc: not available
More information about the ffmpeg-devel