[Ffmpeg-devel] a little optim for a SSE version of H263_LOOP_FILTER

Sun Nov 5 19:06:28 CET 2006

Hi,

On 11/5/06, Michael Niedermayer <michaelni at gmx.at> wrote:
> Hi
>
> On Sun, Nov 05, 2006 at 04:50:10PM +0100, Guillaume POIRIER wrote:

[...]

> > Note that movq is very slow on P4, so any code that removes
> > mov(q|dqu|..) provides an interesting speed-up.
>
> why dont you try to replace all reg, reg movq by pshufw? if theres a
> speed up then we could make movq a macro which expends depending on
> cpu type to movq or pshufw $11100100b, ...

P4 optimization manual actually advises to try to use shuffle
operations instead of mov between vector regs.

However, unconditionally replacing movs by shuffles won't work. mov*
use FP_MOV unit, whereas *shuf* uses MMX_SHIFT unit, which is  (see
the diagram here: http://www.tommesani.com/P4MMX.html )

That means that you'd put pressure in FP_EXECUTE unit, on port 1 of
the micro-arch, whereas FP_MOV is hooked-up to port 0....

Per my understanding, if FP_EXECUTE is not too crowded, you could gain
from using shuffle operation, but only in that case.
It's sufficiently uneasy to guess when this or that unit is used in a
massive OOO CPU such as the P4 that I'm just reluctant to spend much
time trying to see what works best.
Moreover, it would only work on P4, which is the only cpu in x86 world
that has such peculiar instruction latencies.

On top of that, I don't even own a P4 ;-)

Guillaume
-- 
With DADVSI (http://en.wikipedia.org/wiki/DADVSI), France finally has
a lead on USA on selling out individuals right to corporations!
Vive la France!