[Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)
Panagiotis Issaris
takis.issaris
Mon Dec 11 16:37:07 CET 2006
Hi
On Mon, 2006-12-11 at 15:29 +0100, Michael Niedermayer wrote:
> > On Mon, 2006-12-11 at 13:24 +0100, Panagiotis Issaris wrote:
> > > [...]
> > > I reran the tests on a Pentium 4 CPU 3.20GHz and on that machine it
> > > appears to make a consistent difference of about 200 clock cycles.
> > >
> > > With the for loops:
> > > ...
> > > 1983 dezicycles in DCTFOR, 16775281 runs, 1935 skips3689.0kbits/s
> > > frame= 101 q=-1.0 Lsize= 1652kB time=4.0 bitrate=3350.1kbits/s
> > > video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
> > >
> > > Repeated runs gave: 1991, 1986, 1994, 1995, 1997, 2061
> > >
> > > Without the for loops:
> > > ...
> > > 1809 dezicycles in DCT, 16776700 runs, 516 skipsate=3640.6kbits/s
> > > frame= 101 q=-1.0 Lsize= 1652kB time=4.0 bitrate=3350.1kbits/s
> > > video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
> > >
> > > Repeated runs gave: 1806, 1790, 1805, 1814, 1826, 1835
> > >
> > > So, on Athlon64 it appears to make no real difference, on P4 it does.
> > > I'll try and rewrite it a bit shorter using a macro.
> > Patch which uses two macros to shorten the DCT implementation attached.
> >
> > Any preference towards names such as TEMP|INTERMEDIATE and FINAL instead
> > of PART1 and PART2?
>
> i am fine with all of them ...
Okay.
> [...]
>
> > +#define H264_DCT_PART1(X) \
> > + a = block[0][X]+block[3][X]; \
> > + c = block[0][X]-block[3][X]; \
> > + b = block[1][X]+block[2][X]; \
> > + d = block[1][X]-block[2][X]; \
> > + pieces[0][X] = a+b; \
> > + pieces[2][X] = a-b; \
> > + pieces[1][X] = (c<<1)+d; \
> > + pieces[3][X] = c-(d<<1);
> > +
> > +#define H264_DCT_PART2(X) \
> > + a = pieces[X][0]+pieces[X][3]; \
> > + c = pieces[X][0]-pieces[X][3]; \
> > + b = pieces[X][1]+pieces[X][2]; \
> > + d = pieces[X][1]-pieces[X][2]; \
> > + block[0][X] = a+b; \
> > + block[2][X] = a-b; \
> > + block[1][X] = (c<<1)+d; \
> > + block[3][X] = c-(d<<1);
>
> actually the pieces array seems unneeded block could be used if its not
> slower ...
I'm not really sure, and the code has been written a while ago, but on
first sight it appears as if blocks[][] would be written to, which are
used afterward. I'll have a closer look if reordering the operations
would help preventing this.
> and what about int a,b,c,d instead of DCTELEM? (benchmark ...)
With int a,b,c,d:
2077 dezicycles in dctint, 16775342 runs, 1874 skips11825.4kbits/s
frame= 101 q=-1.0 Lsize= 5730kB time=4.0 bitrate=11619.0kbits/s
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 2113, 2130
Compared to DCTELEM a,b,c,d:
1826 dezicycles in dctelem, 16776192 runs, 1024 skips1825.4kbits/s
frame= 101 q=-1.0 Lsize= 5730kB time=4.0 bitrate=11619.0kbits/s
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 1823, 1815
Compared to both int a,b,c,d and pieces being an int matrix:
1838 dezicycles in dctint, 16776610 runs, 606 skips=11825.4kbits/s
frame= 101 q=-1.0 Lsize= 5730kB time=4.0 bitrate=11619.0kbits/s
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 1813, 1837
So this appears to bring no advantage.
> and patch ok, feel free to commit
Thanks for reviewing!
With friendly regards,
Takis
--
vCard: http://www.issaris.org/pi.vcf
Public key: http://www.issaris.org/pi.key
More information about the ffmpeg-devel
mailing list