[Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)
Panagiotis Issaris
takis.issaris
Mon Dec 11 01:20:05 CET 2006
Hi Michael,
On Sat, Dec 09, 2006 at 02:47:02AM +0100, Michael Niedermayer wrote:
>[...]
> > + c = pieces[2][0]-pieces[2][3];
> > + b = pieces[2][1]+pieces[2][2];
> > + d = pieces[2][1]-pieces[2][2];
> > + block[0][2] = a+b;
> > + block[2][2] = a-b;
> > + block[1][2] = (c<<1)+d;
> > + block[3][2] = c-(d<<1);
> > +
> > + a = pieces[3][0]+pieces[3][3];
> > + c = pieces[3][0]-pieces[3][3];
> > + b = pieces[3][1]+pieces[3][2];
> > + d = pieces[3][1]-pieces[3][2];
> > + block[0][3] = a+b;
> > + block[2][3] = a-b;
> > + block[1][3] = (c<<1)+d;
> > + block[3][3] = c-(d<<1);
> > +}
>
> i assume that a for loop would slow this down significantly? if so a macro would
> make that much smaller without speed loss ...
I've tested this like this:
163 START_TIMER
164 DCTELEM pieces[4][4];
165 DCTELEM a, b, c, d;
166 int i;
167
168 for (i=0; i<4; i++)
169 {
170 a = block[0][i]+block[3][i];
171 c = block[0][i]-block[3][i];
172 b = block[1][i]+block[2][i];
173 d = block[1][i]-block[2][i];
174 pieces[0][i] = a+b;
175 pieces[2][i] = a-b;
176 pieces[1][i] = (c<<1)+d;
177 pieces[3][i] = c-(d<<1);
178 }
179
180 for (i=0; i<4; i++)
181 {
182 a = pieces[i][0]+pieces[i][3];
183 c = pieces[i][0]-pieces[i][3];
184 b = pieces[i][1]+pieces[i][2];
185 d = pieces[i][1]-pieces[i][2];
186 block[0][i] = a+b;
187 block[2][i] = a-b;
188 block[1][i] = (c<<1)+d;
189 block[3][i] = c-(d<<1);
190 }
191 STOP_TIMER("DCTFOR")
Resulting in:
...
924 dezicycles in DCTFOR, 8387443 runs, 1165 skipste=1350.3kbits/s
frame= 1989 q=-1.0 Lsize= 11046kB time=66.4 bitrate=1363.5kbits/s
video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
When using the DCT without loops:
...
914 dezicycles in DCT, 8387499 runs, 1109 skipstrate=1351.4kbits/s
frame= 1989 q=-1.0 Lsize= 11046kB time=66.4 bitrate=1363.5kbits/s
video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
But, the runs varied over a range bigger then the difference shown above. I got
runs of 924, 944 and more decicycles for the DCT without the loops as well. Same
for the DCT with the for loops, decicycles spent in the DCT varied from 910 to
980. So, to me, it appears adding the loop doesn't hurt much. The tests above
took place on a Athlon64 X2 3800+. I will conduct the same tests tomorrow on a
P4 and see if it makes a considerable difference on that machine.
With friendly regards,
Takis
More information about the ffmpeg-devel
mailing list