[Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)

Mon Dec 11 16:37:07 CET 2006

Hi

On Mon, 2006-12-11 at 15:29 +0100, Michael Niedermayer wrote:
> > On Mon, 2006-12-11 at 13:24 +0100, Panagiotis Issaris wrote:
> > > [...]
> > > I reran the tests on a Pentium 4 CPU 3.20GHz and on that machine it
> > > appears to make a consistent difference of about 200 clock cycles.
> > > 
> > > With the for loops:
> > > ...
> > > 1983 dezicycles in DCTFOR, 16775281 runs, 1935 skips3689.0kbits/s    
> > > frame=  101 q=-1.0 Lsize=    1652kB time=4.0 bitrate=3350.1kbits/s    
> > > video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
> > > 
> > > Repeated runs gave: 1991, 1986, 1994, 1995, 1997, 2061
> > > 
> > > Without the for loops:
> > > ...
> > > 1809 dezicycles in DCT, 16776700 runs, 516 skipsate=3640.6kbits/s    
> > > frame=  101 q=-1.0 Lsize=    1652kB time=4.0 bitrate=3350.1kbits/s    
> > > video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
> > > 
> > > Repeated runs gave: 1806, 1790, 1805, 1814, 1826, 1835
> > > 
> > > So, on Athlon64 it appears to make no real difference, on P4 it does.
> > > I'll try and rewrite it a bit shorter using a macro.
> > Patch which uses two macros to shorten the DCT implementation attached.
> > 
> > Any preference towards names such as TEMP|INTERMEDIATE and FINAL instead
> > of PART1 and PART2?
> 
> i am fine with all of them ...
Okay.

> [...]
> 
> > +#define  H264_DCT_PART1(X) \
> > +         a = block[0][X]+block[3][X]; \
> > +         c = block[0][X]-block[3][X]; \
> > +         b = block[1][X]+block[2][X]; \
> > +         d = block[1][X]-block[2][X]; \
> > +         pieces[0][X] = a+b; \
> > +         pieces[2][X] = a-b; \
> > +         pieces[1][X] = (c<<1)+d; \
> > +         pieces[3][X] = c-(d<<1);
> > +
> > +#define  H264_DCT_PART2(X) \
> > +         a = pieces[X][0]+pieces[X][3]; \
> > +         c = pieces[X][0]-pieces[X][3]; \
> > +         b = pieces[X][1]+pieces[X][2]; \
> > +         d = pieces[X][1]-pieces[X][2]; \
> > +         block[0][X] = a+b; \
> > +         block[2][X] = a-b; \
> > +         block[1][X] = (c<<1)+d; \
> > +         block[3][X] = c-(d<<1);
> 
> actually the pieces array seems unneeded block could be used if its not
> slower ...
I'm not really sure, and the code has been written a while ago, but on
first sight it appears as if blocks[][] would be written to, which are
used afterward. I'll have a closer look if reordering the operations
would help preventing this.

> and what about int a,b,c,d instead of DCTELEM? (benchmark ...)
With int a,b,c,d:
2077 dezicycles in dctint, 16775342 runs, 1874 skips11825.4kbits/s    
frame=  101 q=-1.0 Lsize=    5730kB time=4.0 bitrate=11619.0kbits/s    
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 2113, 2130

Compared to DCTELEM a,b,c,d:
1826 dezicycles in dctelem, 16776192 runs, 1024 skips1825.4kbits/s    
frame=  101 q=-1.0 Lsize=    5730kB time=4.0 bitrate=11619.0kbits/s    
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 1823, 1815

Compared to both int a,b,c,d and pieces being an int matrix:
1838 dezicycles in dctint, 16776610 runs, 606 skips=11825.4kbits/s    
frame=  101 q=-1.0 Lsize=    5730kB time=4.0 bitrate=11619.0kbits/s    
video:5730kB audio:0kB global headers:0kB muxing overhead 0.000000%
More runs: 1813, 1837

So this appears to bring no advantage.

> and patch ok, feel free to commit
Thanks for reviewing!

With friendly regards,
Takis
-- 
vCard: http://www.issaris.org/pi.vcf
Public key: http://www.issaris.org/pi.key