[FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

Fri Apr 1 17:31:11 CEST 2016

On Fri, Apr 1, 2016 at 4:15 PM, Matthieu Bouron <matthieu.bouron at gmail.com>
wrote:

>
>
> On Mon, Mar 28, 2016 at 9:12 PM, Matthieu Bouron <
> matthieu.bouron at gmail.com> wrote:
>
>>
>>
>> On Sun, Mar 27, 2016 at 5:58 PM, Matthieu Bouron <
>> matthieu.bouron at gmail.com> wrote:
>>
>>>
>>>
>>> On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron <
>>> matthieu.bouron at gmail.com> wrote:
>>>
>>>> The following patchset aims to make bitexact the yuv->rgba armv7 neon
>>>> code path
>>>> with the aarch64 one. It also aims to make the two code bases as close
>>>> as
>>>> possible.
>>>>
>>>> [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path
>>>>
>>>> The current 32bit code path which is unused is removed.
>>>>
>>>> [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>>>>
>>>> The code process only one line at a time for the yuv420p,nv12 and nv21
>>>> formats
>>>> with no regression in performance observed on a rpi2 (I've even
>>>> observed a
>>>> slight increase of performance for the nv12 and nv21 formats).
>>>>
>>>> [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its
>>>>
>>>> The last patch of the serie makes the code bitexact with the aarch64
>>>> version.
>>>> The increase of precision (which introduces a performance loss) is
>>>> compensated
>>>> by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.
>>>>
>>>> ./ffmpeg_g -nostats -f lavfi -i
>>>> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
>>>> null -
>>>>
>>>> without patchset :
>>>> [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605
>>>>
>>>> with patchset:
>>>> [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884
>>>
>>>
>>> I've managed tu run the code on a beagle bone black board, here are the
>>> results:
>>>
>>> nv12->bgra
>>> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743
>>> max:0.032600 min:0.011513
>>> with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438
>>> avg:0.013659 max:0.034427 min:0.013411
>>> with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554
>>> avg:0.012751 max:0.034288 min:0.012523
>>>
>>> yuv420p->bgra
>>> without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159
>>> max:0.033866 min:0.012945
>>> with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154
>>> avg:0.015358 max:0.036186 min:0.015134
>>> with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623
>>> avg:0.014784 max:0.035487 min:0.014568
>>>
>>> So it looks like processing one line at a time as negative effect on
>>> performance on this board (as opposed to the rpi2). I'll try to keep the
>>> two line processing code and post some result (so we can decide, which
>>> version to choose).
>>>
>>
>> I've managed to update the patchset to keep processing two line at a time
>> for the nv12,nv21 and yuv420p formats, here are the results:
>>
>> ./ffmpeg_g -nostats -f lavfi -i
>> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
>> null -
>>
>> Beagle bone black:
>> without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743
>> max:0.032600 min:0.011513
>> with patchset v1: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751
>> max:0.034288 min:0.012523
>> with patchset v2: [bench @ 0x10f92d0] t:0.011239 avg:0.011408
>> max:0.032124 min:0.011202
>>
>> Nexus5:
>> without patchset: avg: ~2,869ms
>> with patchset v1: avg: ~3,008ms
>> with patchset v2: avg: ~2,702ms
>>
>> RPI2:
>> without patchset: [bench @ 0x3eb6a0] t:0.020660 avg:0.020813
>> max:0.039399 min:0.020605
>> with patchset v1:  [bench @ 0xe5f6a0] t:0.018924 avg:0.019075
>> max:0.037472 min:0.01884
>> with patchset v2: [bench @ 0xc1b6a0] t:0.020999 avg:0.021203 max:0.052184
>> min:0.020768
>>
>> Given the following the results, i will drop the current patchset and
>> submit another one (which keeps processing two lines at a time).
>>
>
> I will push the updated patchset (which takes into account Benoit's
> comments) in one hour~.
>

Pushed.