[FFmpeg-devel] gcc: Remove auto-vectorization limitation.

Wed May 21 21:12:58 EEST 2025

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Martin
> Storsjö
> Sent: Mittwoch, 21. Mai 2025 14:22
> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] gcc: Remove auto-vectorization limitation.
> 
> On Wed, 21 May 2025, Andreas Rheinhardt wrote:
> 
> > Martin Storsjö:
> >> On Wed, 21 May 2025, Andreas Rheinhardt wrote:
> >>
> >>> Jiawei:
> >>>> This patch modifies the FFmpeg build system to remove the explicit
> >>>> disabling
> >>>> of GCC's auto-vectorization feature.
> >>>>
> >>>> Modern GCC versions (>= 10.0) have demonstrated stable auto-
> >>>> vectorization
> >>>> capabilities through extensive optimizations in loop analysis and SIMD
> >>>> code generation. The explicit -fno-tree-vectorize flag originally added
> >>>> in commit 973859f (2009) to workaround early GCC vectorization
> >>>> instability
> >>>> is no longer necessary.
> >>>>
> >>>> Key improvements justifying this change:
> >>>> 1. Enhanced heuristics for loop vectorization cost models
> >>>> 2. Mature handling of alignment and memory access patterns
> >>>> 3. Robust fallback mechanisms for unsupported architectures
> >>>>
> >>>> This change allows FFmpeg to benefit from automated SIMD optimizations
> >>>> when built with -O3 optimization level, particularly improving
> >>>> performance on x86_64 (AVX), ARM64 (SVE) and RISC-V(RVV) architectures.
> >>>>
> >>>> [1] https://git.ffmpeg.org/gitweb/ffmpeg.git/
> >>>> commit/973859f5230e77beea7bb59dc081870689d6d191
> >>>>
> >>>> ---
> >>>>  configure | 1 -
> >>>>  1 file changed, 1 deletion(-)
> >>>>
> >>>> diff --git a/configure b/configure
> >>>> index 3730b0524c..b9e95ce4ec 100755
> >>>> --- a/configure
> >>>> +++ b/configure
> >>>> @@ -7656,7 +7656,6 @@ if enabled icc; then
> >>>>              disable aligned_stack
> >>>>      fi
> >>>>  elif enabled gcc; then
> >>>> -    check_optflags -fno-tree-vectorize
> >>>>      check_cflags -Werror=format-security
> >>>>      check_cflags -Werror=implicit-function-declaration
> >>>>      check_cflags -Werror=missing-prototypes
> >>>
> >>> FYI: The last discussion about auto-vectorization is here:
> >>> https://ffmpeg.org/pipermail/ffmpeg-devel/2022-July/299405.html
> >>> It contains a report about a failing build with vectorization enabled:
> >>> https://ffmpeg.org/pipermail/ffmpeg-devel/2022-July/299421.html
> >>> I don't know whether this is still reproducible with the latest GCC.
> >>
> >> The issue which was reported last time, when compiling for i686 mingw32
> >> with --cpu=haswell, seems to have gone away in
> >> 182663a58a7a099e02e76da3b0f96d63e5c26a6d, where we made the whole
> >> problematic x86 inline cabac assembly noinline on i386. (That whole
> >> inline assembly block has been problematic in a large number of cases
> >> anyway.)
> >>
> >
> > So there are currently no known miscompilations due to vectorization
> > with GCC?
> 
> I'm not aware of any, but I haven't tested widely. It certainly is worth
> evalulating.
> 
> (From dav1d, I can anecdotally add that autovectorization does seem to
> help, somewhat, especially when there's not 100% assembly coverage for the
> use case. For some cases it make things slower than without
> autovectorization, but generally the net result is positive.)
> 
> // Martin
> _______________________________________________

Hi,

a few years ago, I had spent days on that subject. Intel have some great
tools which allow precise analysis of how the compiler applies those
vectorization and loop optimizations - and it also works when it was
compiled with gcc, which is what I had been investigating. Focus was
the code in the vf_tonemap filter, later I briefly confirmed my findings
by looking at some other examples. Platform was x86_x64 only.

The outcome was that enabling tree-vectorize is beneficial, but combining
it with -O3 has adverse effects. Since then, we are using -O2 with 
tree-vectorization enabled on all platforms.

For CPU tone mapping, I still ended up doing a SIMD implementation using 
Intel intrinsics 😊

Best
sw