[FFmpeg-devel] [PATCH] vc1dsp: Port ff_vc1_put_ver_16b_shift2_mmx to yasm
christophe.gisquet at gmail.com
Wed Oct 21 19:45:15 CEST 2015
2015-10-18 2:47 GMT+02:00 Timothy Gu <timothygu99 at gmail.com>:
> This function is only used within other inline asm functions, hence the
> HAVE_MMX_INLINE guard. Per recent discussions, we should not worry about
> the performance of inline asm-only builds.
On a quick glance, looks good.
> The conversion process has to start _somewhere_...
> + movh m2, [srcq]
> + add srcq, strideq
> + movh m3, [srcq]
> + punpcklbw m2, m0
> + punpcklbw m3, m0
> + SHIFT2_LINE 0, 1, 2, 3, 4
> + SHIFT2_LINE 24, 2, 3, 4, 1
> + SHIFT2_LINE 48, 3, 4, 1, 2
> + SHIFT2_LINE 72, 4, 1, 2, 3
> + SHIFT2_LINE 96, 1, 2, 3, 4
> + SHIFT2_LINE 120, 2, 3, 4, 1
> + SHIFT2_LINE 144, 3, 4, 1, 2
> + SHIFT2_LINE 168, 4, 1, 2, 3
> + sub srcq, stride_9minus4
> + add dstq, 8
> + dec i
> + jnz .loop
The following remarks are for potential later work and food for thought.
I'm the first offender, but that loop expands to ~100 instructions. I
don't know what others may have as an opinion on this, but that might
be a tad bit. So maybe specializing for particular shift and round
values (if possible, I don't remember) would be better.
Then there's the fact the 16-wide blocks are currently handled as 2x8
(iirc), that would suggest doing part of this in C.
On the other hand, idcts are not yet implemented, and there are h/w
decoders doing a better job of decoding vc1, so it may be a waste of
time (hence why I myself never did all of this).
More information about the ffmpeg-devel