[FFmpeg-user] should I shoot the dog?

Tue Sep 29 17:48:42 EEST 2020

On 09/29/2020 09:20 AM, Devin Heitmueller wrote:
> Hi Mark,

Hi Devin. Thanks much!

Your response came in while I was composing my previous message. I see (below) that performance is a 
major issue. That absolutely makes sense because, after accuracy, speed is the next most important 
objective (and for some use cases, may actually be more important).

I imagine that format-to-format conversion is probably the most optimized code in ffmpeg. Is there a 
function library dedicated solely to format conversion? I ask so that, in what I write, I can assure 
users that the issues are known and addressed.

For my modest purposes, a sketch of planar v. packed is probably all that's needed. I think you've 
made "planar" clear. Thank you for that. I can imagine that the structure of packed is 
multitudinous. Why is it called "packed"? How is it packed? Are the luma & chroma mixed in one 
buffer (analogous to blocks in macroblocks) or split into discrete buffers? How are they spacially 
structured? Is there any special sub structures (analogous to macroblocks in slices)? Are the sub 
structures, if any, format dependent?

> So when you talk about the decoded frames, there is no concept of
> macroblocks.  There are simple video frames with Y, Cb, Cr samples.
> How those samples are organized and their sizes are determined by the
> AVFrame format.
> 
>> "Packed" and "planar", eh? What evidence do you have? ...Share the candy!
>>
>> Now, I'm not talking about streams. I'm talking about after decoding. I'm talking about the buffers.
>> I would think that a single, consistent format would be used.
> 
> When dealing with typical consumer MPEG-2 or H.264 video, the decoded
> frames will typically have what's referred to as "4:2:0 planar"
> format.  This means that the individual Y/Cb/Cr samples are not
> contiguous.  If you look at the underlying data that makes up the
> frame as an array, you will typically have W*H Y values, followed by
> W*H/4 Cb values, and then there will be W*H/4 Cr values.  Note that I
> say "values" and not "bytes", as the size of each value may vary
> depending on the pixel format.
> 
> Unfortunately there is no "single, consistent format" because of the
> variety of different ways in which the video can be encoded, as well
> as performance concerns.  Normalizing it to a single format can have a
> huge performance cost, in particular if the original video is in a
> different colorspace (e.g. the video is YUV and you want RGB).
> Generally speaking you can set up the pipeline to always deliver you a
> single format, and ffmpeg will automatically perform any
> transformations necessary to achieve that (e.g. convert from packed to
> planer, RGB to YUV, 8-bit to 10-bit, 4:2:2 to 4:2:0, etc).  However
> this can have a severe performance cost and can result in quality loss
> depending on the transforms required.
> 
> The codec will typically specify its output format, largely dependent
> on the nature of the encoding, and then announce AVFrames that conform
> to that format.  Since you're largely dealing with MPEG-2 and H.264
> video, it's almost always going to be YUV 4:2:0 planar.  The filter
> pipeline can then do conversion if needed, either because you told it
> to convert it or because you specified some filter pipeline where the
> individual filter didn't support what format it was being given.
> 
> See libavutil/pixfmt.h for a list of all the possible formats in which
> AVFrames can be announced by a codec.
> 
> Devin

-- 
The U.S. political problem? Amateurs are doing the street fighting.
The Princeps Senatus and the Tribunus Plebis need their own armies.