[FFmpeg-devel] [PATCH] vc1dsp: Port ff_vc1_put_ver_16b_shift2_mmx to yasm

Wed Oct 21 19:45:15 CEST 2015

Hi,

2015-10-18 2:47 GMT+02:00 Timothy Gu <timothygu99 at gmail.com>:
> This function is only used within other inline asm functions, hence the
> HAVE_MMX_INLINE guard. Per recent discussions, we should not worry about
> the performance of inline asm-only builds.

On a quick glance, looks good.

> The conversion process has to start _somewhere_...

True.

> +.loop:
> +    movh               m2, [srcq]
> +    add              srcq, strideq
> +    movh               m3, [srcq]
> +    punpcklbw          m2, m0
> +    punpcklbw          m3, m0
> +    SHIFT2_LINE         0, 1, 2, 3, 4
> +    SHIFT2_LINE        24, 2, 3, 4, 1
> +    SHIFT2_LINE        48, 3, 4, 1, 2
> +    SHIFT2_LINE        72, 4, 1, 2, 3
> +    SHIFT2_LINE        96, 1, 2, 3, 4
> +    SHIFT2_LINE       120, 2, 3, 4, 1
> +    SHIFT2_LINE       144, 3, 4, 1, 2
> +    SHIFT2_LINE       168, 4, 1, 2, 3
> +    sub              srcq, stride_9minus4
> +    add              dstq, 8
> +    dec                 i
> +        jnz         .loop

The following remarks are for potential later work and food for thought.

I'm the first offender, but that loop expands to ~100 instructions. I
don't know what others may have as an opinion on this, but that might
be a tad bit. So maybe specializing for particular shift and round
values (if possible, I don't remember) would be better.

Then there's the fact the 16-wide blocks are currently handled as 2x8
(iirc), that would suggest doing part of this in C.

On the other hand, idcts are not yet implemented, and there are h/w
decoders doing a better job of decoding vc1, so it may be a waste of
time (hence why I myself never did all of this).

-- 
Christophe