[FFmpeg-devel] [PATCH 5/7] ARM: NEON optimised H.264 8x8 and 16x16 qpel MC
Måns Rullgård
mans
Mon Dec 8 16:03:04 CET 2008
"Ian Caulfield" <ian.caulfield at gmail.com> writes:
> 2008/12/5 Mans Rullgard <mans at mansr.com>:
>
>> +
>> + vshl.i16 q3, q1, #4
>> + vshl.i16 q1, q1, #2
>> + vshl.i16 q15, q2, #2
>> + vadd.i16 q1, q1, q3
>> + vadd.i16 q2, q2, q15
>> +
>> + vshl.i16 q3, q9, #4
>> + vshl.i16 q9, q9, #2
>> + vshl.i16 q15, q10, #2
>> + vadd.i16 q9, q9, q3
>> + vadd.i16 q10, q10, q15
>> +
>> + vsub.i16 q1, q1, q2
>> + vsub.i16 q9, q9, q10
>
> Is this any faster? I don't know what the interlocking will be like,
> nor whether you have a spare register to hold the scalar... (or even
> if setting up the scalars would make it slower)
>
> vmul.i16 q1, q1, <scalar set to 6>
> vmul.i16 q9, q9, <scalar set to 6>
> vmls.i16 q1, q2, <scalar set to 3>
> vmls.i16 q9, q10, <scalar set to 3>
How is that equivalent? My code is doing q1 = q1*20 - q2*5. Yours
does q1 = q1*6 - q2*3.
As for scheduling, 16-bit VMUL on Q registers needs two issue cycles
and has a result latency of 6 cycles. The following vadd needs it's
operands in N2, so the pipeline will stall for 3 cycles for a total of
11 cycles. If the constants can be set up conveniently it might just
be faster. I don't remember what spare registers are available here.
Maybe I should reevaluate my fear of multiplication.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list