[FFmpeg-devel] [PATCH 4/4] huffyuvencdsp: Add ff_diff_bytes_avx2

Henrik Gramner henrik at gramner.com
Mon Oct 19 22:56:35 CEST 2015

On Mon, Oct 19, 2015 at 10:00 PM, Timothy Gu <timothygu99 at gmail.com> wrote:
> About 16% faster on large clips (>1200px width), more than 2x slower on small clips
> (352px).

The reason is for this is likely the fact that you fall back to scalar
as soon as you have less than 2*mmsize bytes left to process which
leads to a larger portion being done in scalar with larger vector

A possible workaround for this is to gradually decrease the amount you
process with SIMD when you're approaching the end, e.g. fallback to
using xmm registers, then half of an xmm register, and maybe even a
quarter of an xmm register (as always, benchmark to see what helps)
before doing scalar for the last few bytes.

This is assuming that you cannot overread src and/or overwrite dst, if
you're allowed to do that then it's a bit easier of course.

More information about the ffmpeg-devel mailing list