[FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16

Tue Sep 6 22:50:21 EEST 2022

>Gesendet: Montag, 05. September 2022 um 21:58 Uhr
>Von: "Martin Storsjö" <martin at martin.st>
>An: ffmpeg-devel at ffmpeg.org
>Betreff: Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
>On Mon, 5 Sep 2022, Martin Storsjö wrote:
>
>> This matches a similar cap on the number of automatic threads
>> in libavcodec/pthread_slice.c.
>>
>> On systems with lots of cores, this does speed things up in
>> general (measurable on the level of the runtime of running
>> "make fate"), and fixes a couple fate failures in 32 bit mode on
>> such machines (where spawning a huge number of threads runs
>> out of address space).
>> ---
>
> On second thought - this observation that it speeds up "make -j$(nproc)
> fate" isn't surprising at all; as long as there are jobs to saturate all
> cores with the make level parallelism anyway, any threading within each
> job just adds extra overhead, nothing more.
>
> // Martin

Agreed, this observation of massively parallel test runs does not tell
much about real world performance.
There are really two separate issues here:

1. Running out of address space in 32-bit processes

It probably makes sense to limit auto threads to 16, but it should only
be done in 32-bit processes. A 64-bit process should never run out of
address space. We should not cripple high end machines running
64-bit applications.

Sidenotes about "it does not make sense to have more than 16 slices":

On 8K video, when using 32 threads, each thread will process 256 lines
or about 1MP (> FullHD!). Sure makes sense to me. But even for sw decoding
4K video, having more than 16 threads on a powerful machine makes sense.

Intel's next desktop CPUs will have up to 24 physical cores. The
proposed change would limit them to use only 16 cores, even on 64-bit.

2. Spawning too many threads when "auto" is used in multiple places

This can indeed be an efficiency problem, although probably not major.
Since usually only one part of the pipeline is active at any time,
many of the threads will be sleeping, consuming very little resources.

The issue only affects certain scenarios. If someone has such
a scenario and wants to optimize, they could explicitly set threads to 
a lower value, and see if it helps.

Putting an arbitrary limit on threads would only "solve" this issue
for the biggest CPUs (which have more than enough power anyways),
at the cost of crippling their performance in other scenarios.

A "normal" <= 8 core CPU might still end up with 16 threads for
the decoder, 16 threads for effects and 16 threads for encoding,
with 2/3 of them sleeping at any time.

--> The issue affects only certain scenarios. The proposed fix only
    fixes it for a minority of all PCs, while it cripples performance
    of these PCs in other scenarios.

--> I do not think that this 16 threads limit is a good idea.
    IMHO "auto" should always use the logical CPU count,
    except for 32-bit applications.

The only true solution to this problem would be adding a shared
thread pool. The application would create the pool when it is started,
with the number of logical CPU cores as maximum (maybe limit on 32 bits).
It passes this to all created decoders/encoders/filters. But doing this
correctly is a major task, and it would require major rework in all areas
where multi threading is used now. Not sure if the problem is really big
enough to justify this effort.