[FFmpeg-devel] [PATCH 09/13] avcodec/svq1dec: clear MMX state after MB decode loop

Wed Oct 26 18:42:30 EEST 2016

On Wed, Oct 26, 2016 at 04:21:14PM +0200, Hendrik Leppkes wrote:
> On Wed, Oct 26, 2016 at 3:54 PM, Michael Niedermayer
> <michael at niedermayer.cc> wrote:
> > On Tue, Oct 25, 2016 at 12:00:01AM +0200, Hendrik Leppkes wrote:
> >> On Mon, Oct 24, 2016 at 10:31 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >> > Hi,
> >> >
> >> > On Mon, Oct 24, 2016 at 4:26 PM, Henrik Gramner <henrik at gramner.com> wrote:
> >> >
> >> >> On Mon, Oct 24, 2016 at 9:59 PM, Ronald S. Bultje <rsbultje at gmail.com>
> >> >> wrote:
> >> >> > Good idea to reference Hendrik Gramner here, who keeps insisting we get
> >> >> rid
> >> >> > of all MMX code in ffmpeg (at least as an option) for future Intel CPUs
> >> >> in
> >> >> > which MMX will be deprecated.
> >> >>
> >> >> Replacing MMX with SSE2 is indeed the most "proper" fix in my opinion,
> >> >> but it's a fair amount of work and not done in an evening.
> >> >>
> >> >> The fact that a lot of assembly lacks unit tests is certainly not
> >> >> helping in that regard.
> >> >>
> >> >> Some MMX instructions are slower than the equivalent SSE2 code on
> >> >> Skylake. Intel hasn't officially commented on (as far as I know at
> >> >> least) if we should expect this trend to continue, but they certainly
> >> >> seem to treat MMX as legacy.
> >> >>
> >> >> I doubt they would completely remove support for it though, backwards
> >> >> compatibility is a big selling-point for x86.
> >> >
> >> >
> >> > Well, it gives us another way of fixing this issue (on x86-64 only): have
> >> > sse2 implementations for all code that has a mmx (register) path right now.
> >> >
> >>
> >> I don't think the argument for pre-sse2 CPUs is that strong on 32-bit
> >> systems, either.
> >
> > SSE2 was initially not faster than MMX as CPUs implemented it as 2
> > MMX operations internally not having a full width SIMD unit for SSE*
> > so there would be a performace loss on some x86-32 CPUs if MMX was
> > replaced by "half-width SSE2" there
> >
> 
> You can add "not caring about first-gen sse2 CPUs" to the list as

its more like 3 or 4 generations than 1 according to the instruction
tables from Agner Fog

core 2 (Merom) seems the first that has partial full width support
shift/pack/unpack/shuffle still are faster as MMX
PM, P4, P4E all seem half speed at SSE* than MMX

> well, if you want. Those are way old as well.

> There is going to be a performance loss either way, except that emms
> slows it down everywhere, while using sse2 is likely to be fine on

minor detail being that there is a factor of around
ten thousand in the speed loss between the 2 cases you compare
(0.001% vs maybe 50%)

Droping MMX will cause pre SSE2 CPUs to be alot slower, maybe half
speed overall or less, they loose all SIMD optimizations. On older
SSE2 cpus its still going to be a hefty hit too.
adding emms at a video frame or slice level which is what the patches
posted do pretty much has no real effect but dont belive me look at the
timings worst case i see in agners tables are 18 clock cycles that at
60fps and 1slice and a slow 100mhz cpu is 0.001%
even if there are 100 times more emms (due to slice level EMMS) it
still at the edge of being
hard to meassure. Doing EMMS per function call is of course not
prcatical.

theres an additional penalty for the first float instruction after emms
on some cpus, 58 clock cycles (on P4) but thats still just 0.003% in the
example above.

anyway, i wantd to stay out of this and ill do that, just wanted to
comment on the technical details

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20161026/fc6f276b/attachment.sig>