[FFmpeg-devel] [PATCH 10/39] lavc/ffv1dec: move the bitreader to stack

Thu Jul 18 18:35:14 EEST 2024

On Thu, Jul 18, 2024 at 5:31 PM Anton Khirnov <anton at khirnov.net> wrote:

> Quoting Michael Niedermayer (2024-07-18 16:48:06)
> > On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> > > Quoting Michael Niedermayer (2024-07-18 00:42:05)
> > > > all the stuff should be put together close so its efficiently
> > > > using CPU caches
> > >
> > > Which is why it shares its cacheline with PutBitContext, because the
> > > code benefits from having the both in the cache, right? And the 4-byte
> > > hole in PutBitContext is there presumably to aerate the cache for
> > > smoother data streaming.
> >
> > thanks for spoting these, can you fix these ?
>
> I have no interest in optimizing the performance of this code. My
> primary goal here is to remove FFV1-specific hacks from the frame
> threading code for patch 33/39, which is in turn needed for 38/39.
>
> As a public service, I also spent some effort on making the ffv1 code
> easier to understand, but if you insist on keeping the code as it is I
> can also just drop its non-compliant frame threading implementation.
>
> > >
> > > More seriously, this is not how caches work. Being close together
> > > matters mainly so long as your data fits in a cacheline, beyond that
> > > physical proximity matters little. On stack, the bitreader is likely to
> > > share the cacheline with other data that is currently needed, thus
> > > improving cache utilization.
> >
> > caches are complex, and being close does matter.
> > having things in seperate allocations risks hitting aliassing cases
> > (that is things that cannot be in the cache at the same time)
> > so when you have the bitstream, the frame buffer, the context already
> > in 3 independant locations adding a few more increases the risk for
> hitting
> > these.
> > Also sequential memory access is faster than non sequential, it does
> > make sense to put things together in few places than to scatter them
> >
> > Its years since ive done hardcore optimization stuff but i dont think
> > the principles have changed that much that random access is faster than
> > sequential and that caches work fundamentally differently
>
> I don't see how any of these arguments are relevant - I am not moving
> the bitreader to a new allocation, but to stack, which is already highly
> likely to be in cache.
>
> > >
> > > Another factor that matters in efficient cache use is e.g. not having
> > > multiple copies of the same constant data scattered around, which
> you're
> > > objecting to in my other patches.
> >
> > copying the actually used small data together per slice
> > where its accessed per pixel should improve teh speed per pixel while
> > making the per slice code a little slower. now we have 4 slices maybe
> > and millions of pixels. Thats why this can give an overall gain
>
> This all sounds like premature optimization, AKA the root of all evil.
> As I said above, I intended to make this code more readable, not faster.
> Yet somehow it became faster anyway, which suggests this code is not
> very optimized. So then arguing whether this or that specific change
> adds or removes a few cycles per frame seems like a waste time to me.
>

Will merge this into Librempeg if it does not get into FFmpeg.

>
> --
> Anton Khirnov
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>