[FFmpeg-devel] hardware aided video decoding

Loren Merritt lorenm
Fri Jul 6 17:43:32 CEST 2007

On Fri, Jul 6, 2007, Attila Kinali wrote:
> The system should mostly work with MPEG-1/2/4, H.264 and V-C1,
> but should be generic enough to be able to support future
> codec systems.

My answers to all of the following are in the context of supporting 
mpeg-like codecs, including mpeg1/2/4, h264, vc1, and possibly others in 
the future. These are not the most efficient ways to implement just the 
specific functions we use now.
However, I'm not sure how useful it is to try to support "future codecs". 
It only takes one unsupported feature in any given codec to make it 
impossible to use all of the other accelerated algorithms, unless you 
have a CPU on board that can maybe take up the slack?

> On Fri, 6 Jul 2007, Michael Niedermayer wrote:
>> for mpeg1/2/4 bitstream parsing (vlc decoding and related stuff) takes
>> 1/3 of the cpu time last time i checked so gains with doing that on
>> the CPU and transferring to the card would be limited, also h.264
>> has significantly more complex bitstream parsing so i would guess
>> the gains are even smaller but instead of guessing i would suggest
>> that you try a profiler to get some exact awnsers (dont forget to
>> disable all inlining ...)

The rest of h264 got equivalently more complicated, so bitstream parsing 
is still about 1/3.

>> now if we look at just mpeg1/2/4 and the case that you dont want
>> to implement the whole decoder on the card ...
>> then the most obvious things to do are:
>> do the RLE + zigzag/alt scan decoding of coeffs and the IDCT on the card
>> if you do just the IDCT on the card then you have to transfer 3+ times
>> the data from the cpu to the card as IDCT coeffs are 16bit and there
>> are as many as pixels, if you do the RLE & zigzag stuff on the card too
>> then there would be significantly less data be transmitted as 95% or
>> so of the coeffs are 0 and as the coeffs are stored as vlc coded
>> zero run + sign + level + last_bit in the bitstream
> Why would you start at RLE and zigzag?

You can think of RLE/zigzag as just an easy way to compress the dct 
coefficients going over the bus. It happens to be similar to the way 
coefficients are stored in the bitstream too, but even if that weren't 
the case it still might be worthwhile.
Saving bus bandwidth is also the only reason you would want to implement 
h264 idct. idct only takes 2% of the decode time, so it's not like you'd 
be offloading much from the cpu.

>> the next obvious step is to the motion compensation on the card too
>> for mpeg1/2 and simple profile mpeg4 this should be easy
>> mpeg4 ASP adds gmc/qpel which is much more complex
>> note! if you do not do MC on the card and the result of IDCT +
>> user provided MC frame ends in video memory then the CPU doing MC
>> of the next frame has to read from video mem somehow
> That's why i would rather start at MC than at RLE
>> now h.264 does not contain anything shareable with mpeg1/2/4
>> both idct and MC is different
> How much different are they? Can it be abstracted enough
> so that a common iDCT and MC could be used for both?

idct can be abstracted, but I'm not sure how much die space that saves.
Real (mpeg) idct, h264, and (I think) vc1 idct can use the same butterfly 
structure. They differ in the rotations: Real idct multiplies by big 
constants, vc1 multiplies by small constants, and h264 just bit shifts.

common mc:
The primitive operation of mc is a fir filter. Implement a 2/4/6/8-tap 
fir filter (applying to a block of pixels) with programmable coefficients 
and rounding modes, and allow the firs to be chained in arbitrary ways.
A generic fir filter could by used for wavelets too.
mpeg4 qpel also has some weirdness whereby it mirrors the block edges 
before sending them into the 8-tap.

>> also for h.264 doing just IDCT is likely not going to work, that is
>> having intra prediction done on the cpu which needs to read from the
>> previous 4x4 IDCT result is just going to be a nightmare
> Do i understand you correctly, that the IDCT results depend on
> the results of the previous block?
> (ok, i have to read some h.264 docu)

Decoding a h264 intra block in a software codec:
idct the residual of this block.
Predict the pixels of this block, using the decoded pixels of the 
neighboring blocks (all neighbors: left, top-left, top, top-right), using 
1 of 22 prediction modes.
Add residual to prediction.
Use these newly decoded samples to predict the next block...

If you want to do the prediction in hardware without the idct, that's 

--Loren Merritt

More information about the ffmpeg-devel mailing list