[FFmpeg-devel] RFC: new packed pixel formats (machine vision)

Diederick C. Niehorster dcnieho at gmail.com
Fri Oct 25 12:35:53 EEST 2024


On Wed, Oct 23, 2024 at 2:04 AM martin schitter <ms+git at mur.at> wrote:
>
>
>
> On 22.10.24 22:33, Diederick C. Niehorster wrote:
> > I am writing about machine vision, not machine learning or computer
> > vision. So there are no uncommon small bit sizes, we're dealing with
> > 8bit, 10bit, 12bit components here.
>
> Sorry -- I'm such a sloppy reader/writer -- especially, when I'm hurry.
>
> > Where possible, i already map to the matching ffmpeg format, the
> > problem i am running into is that there isn't one for some of the
> > common machine vision pixel formats.
> > While this can be fixed with an encoder, that would complicate their
> > use in ffmpeg. Having them instead as pixel formats supported by
> > swscale as inputs make them far more generally useful, and enables
> > easy passing these formats to many of ffmpeg's encoders using an
> > auto-negotiated/inserted scale filter.
>
> I'm not a big fan of these auto-negotiated format reading, because the
> actual code which handles this task looks utterly unreadable to me
> --full of exceptions and complicated switches of code flow, which also
> hinders a more efficient processing.
>
> But o.k. it is a comfortable and simple solution for simple demands and
> may even help to reduce bugs, which would otherwise more frequently
> appear by writing new more application/format specific handlers.
>
> > In the previous discussion, Lynne also indicated that the inclusion of
> > such formats is in scope for ffmpeg, as there are also cinema cameras
> > that produce some of them.
>
> Yes he pointed to this already available Bayer Format entries.
>
> They obviously work different from other pixel format description
> entries. I still don't really grasp, how they work exactly.
>
> CFA sensor data is usually more structured like a one channel pixel
> matrix similar to a monochrome image. The colors are only later
> calculated in the debayer process. But at the beginning you have only
> this one-channel matrix of values and an additional description of the
> used CFA arrangement -- i.e. the location of the different colored
> sensel in relation to each other.
>
> This colored graphics in your linked documents are therefore a little
> bit misleading, because if you really would differentiate the colored
> sensels already at this stage, you would have describe different data
> patterns for odd an even lines in case of a typical image sensor...
>
> >>>> Example formats are 10 and 12 bit Bayer formats, where the 10 bit
> >>>> cannot be represented in AVPixFmtDescriptors as currently as effective
> >>>> bit depth for the red and blue channels is 2.5 bits, but component
> >>>> depths should be integers.
>
> At least in case of all ordinary pixel arrangement description entries
> the values are not just useful metadata for further calculations, but
> real descriptions, where to find the actual data -- i.e.: which
> bytes/bits to pick out of the raw data stream.
>
> >> As bits will always be distinct entities, you don't need more than
> >> simple natural numbers to describe their placement and amount precisely.
>
> > An AVPixFmtDescriptor encodes the effective number of bits. Here the
> > descriptor for 8 bit bayer formats already included with ffmpeg:
> > #define BAYER8_DESC_COMMON \
> >          .nb_components= 3, \
> >          .log2_chroma_w= 0, \
> >          .log2_chroma_h= 0, \
> >          .comp = {          \
> >              { 0, 1, 0, 0, 2 }, \
> >              { 0, 1, 0, 0, 4 }, \
> >              { 0, 1, 0, 0, 2 }, \
> >          }
> > Note that the green component is denoted as having 4 bits, and the red
> > and blue as 2 bits. That is because there are only one blue and red
> > sample per 4 pixels, and one per 2 pixels for green samples, leading
> > to _effective bitdepths_ of 8/4=2 for red and blue, and 8/2=4 for
> > green.
>
> It definition so much different to the more ordinary pixel descriptions,
> that hardly understand why they are mixed together at all?
>
> An additional list with more RAW/CFA specific description fields, would
> IMHO be a much more suitable solution.
>
> >> ffmpeg already supports the AV_PIX_FMT_FLAG_BITSTREAM to switch some
> >> description fields from byte to bit values. That's enough to describe
> >> the layout of most pixelformats -- even those packed ones, which are not
> >> aligned to byte or 32bit borders. You just have to use bit size values
> >> for step and offset stuct members.
> >
> > Lynne indicated that AV_PIX_FMT_FLAG_BITSTREAM is only for 8bit and
> > 32bit aligned formats. Here i'm dealing with unaligned formats.
>
> I'm sure, that Lynne is more familiar with this code base and knows it
> much better than me, but I would guess, that this limitation is more
> likely caused by the automatic unpacking mechanism and not so much by
> the pixel format description.
>
> An interesting target for some code contribution, to make the raw data
> reading even more complex, unreadable and a little bit slower again. ;)
>
> > An option could be to release the restriction that
> > AV_PIX_FMT_FLAG_BITSTREAM needs to be 8bit or 32bit aligned, but that
> > would be a backwards incompatible change with not only significant
> > repercussions for the ffmpeg codebase, but also for user code. It is
> > better to have a new flag for the new situation.
>
> I don't know if this will trigger so much troubles.
>
> > I think these are less common, the one exception being some Gige
>
> >> For the simple case of just separated MSb and LSb locations within
> >> otherwise simply repeating pixel bits group it could be solved by
> >> extending the description in a similar way as used in the RGBALayout
> >> description sequenz of MXF -- see G.2.40/p174 of
> >> https://pub.smpte.org/latest/st377-1/st377-1-2019.pdf
> >
> > This looks like a very flexible spec. It would however also require
> > totally overhauling/replacing AVPixFmtDescriptors, which is a no go.
>
> I definitely do not want to suggest rewriting vital parts of ffmpeg in
> this manner, but it's important to have this more modern approaches also
> in mind.
>
> Most of the more recent specified description schemes for wide ranges of
> uncompressed video image data use this kind of more complex and
> sequences of component description lists of variable length, instead of
> just this very simple traditional four channel schemata.
>
> >> I think swscale and the internal processing of ffmpeg should not be
> >> support an endless amount of arbitrary pixel formats, but be focused on
> >> a really useful minimal set of required base formats.
> >
> > As argued above, having native for common pixel formats (of which
> > there are many) makes ffmpeg versatile, and enables most of ffmpeg
> > functionality to be used by most of these pixels formats. Having only
> > a small set complicates the use of all the other formats.
>
> There are good arguments for both variants.
>
> I can only tell you, how I think about this topic -- but I may be wrong!
>
> >> But in generals you should better describe byte/32bit aligned bitpacked
> >> formats by using explicit "fill" (X, etc.) pseudo components, than you
> >> can simply indicate aligned and unaligned groups by the actual sum of
> >> defined bits res. the reminder of a division by the alignment bit size
> >> count.
> >
> > I assume that with fill/X you mean padding, like some of the formats
> > in ffmpeg have.
>
> Isn't padding always used just at one end touching the boundaries, while
> fill may be specified even multiple times anywhere?
>
> > That would not work here, as that would change the
> > definition of a component. gray10p (as i called it) only has one
> > component, but in this scheme it would have five pseudo components (so
> > five color channels, that then have to be interleaved into one?),
> > which 1) isn't what components mean in an AVPixFmtDescriptor and 2) we
> > can only have up to 4.
>
> Yes -- this 4 channel schema is indeed very limiting!
>
> I don't have any better solution, but additional more specialized
> description lists for groups of similar structure (like RAW CFA data)
> could perhaps help. And I really think, that a more specific processing
> for entries described by those additional lists would also help to
> reduce the complexity of the affected code infrastructure and make the
> actually used separated modules more efficient.

Thanks for the additional discussion Martin! The graphics on the
linked page do make sense, they show the pixel arrangement for a
specific color filter arrangement, as an example. And they do show
different lines have different colors.

The AVPixFmtDescriptor for Bayer formats makes sense to me, but you
are indeed right that the depth field is now purely informative and
that a special case for Bayer is needed as the different color
components are packed into different lines of the image.

I think auto format negotiation, complex as it is, is great to
support. Should it go wrong, the user can always manually specify the
(correct) intermediate format.

Adding a new flag for this non-aligned case avoids your worry for more
complex and slower code. All current code would be untouched, there
would be a new branch in format conversion and the utility functions
for reading/writing image lines for unaligned bitpacked formats that i
am proposing.

@Lynne, if you have time, what do you think of my proposal? I hope you
or others are able to give feedback at this design stage before i
spend a lot of time implementing.

Cheers,
Dee


More information about the ffmpeg-devel mailing list