[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM
Måns Rullgård
mans
Mon Jun 30 17:24:17 CEST 2008
Siarhei Siamashka wrote:
> On Saturday 28 June 2008, M?ns Rullg?rd wrote:
>> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>> > On Saturday 28 June 2008, M?ns Rullg?rd wrote:
>> >> Michael Niedermayer <michaelni at gmx.at> writes:
>> >> > Do we have someone who has a arm cpu and can look into the above
>> >> > issue?
>> >>
>> >> I know exactly why it's different. In simple_idct.c, the column
>> >> transform contains these lines:
>> >>
>> >> /* XXX: I did that only to give same values as previous code */
>> >> a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
>> >>
>> >> It's simpler to code that as a0 = W4 * col[0] + (1 << (COL_SHIFT-1)).
>> >> Thinking about it, it only takes one more instruction on NEON, and
>> >> I've fixed that in my tree. With a little luck, the extra instruction
>> >> can be dual-issued with something else.
>> >
>> > This part does not have any extra overhead in my finetuned version
>> > of ARMv5TE IDCT:
>> >
>> > ldr v1, xxx /* v1 = (((1<<(COL_SHIFT-1))/W4)*W4) */
>> > [some unrelated instructions to hide load latency]
>> > smlatt v2, a2, v4, v1 /* A0t = W4 * (col_t[0] +
>> > ((1<<(COL_SHIFT-1))/W4)) */
>> >
>> > There is no reason why ARMv6 or NEON should have overhead too. So getting
>> > bit-identical results to C simple_idct is possible without sacrificing
>> > performance.
>>
>> ((1<<(COL_SHIFT-1))/W4)*W4 doesn't fit in 16 bits, so that method
>> can't easily be used when everything else is in 16-bit vectors.
>
> But (1 << (COL_SHIFT-1)) does not fit it either. I don't see any significant
> difference between:
>
> a0 = W4 * col[0] + (1 << (COL_SHIFT-1));
>
> and
>
> a0 = W4 * col[0] + ((1<<(COL_SHIFT-1))/W4)*W4;
>
> Sure, the first one can be encoded as a constant immediate operand in ARM
> instruction and the second one can't, but that's not a big deal (it can be
> loaded from memory)
Loading from memory is slower than using an immediate operand, and being
wider than 16 bits it can't share a register with other values.
> and is unlikely to be a problem for your NEON code.
I've used one more instruction to get the right value. I doubt it has
any measurable impact on performance, and I can't be bothered to try
thinking of an alternate solution.
> In any case, you can't get away using only 16-bit values, multiplication
> results should be still accumulated somewhere. The difference between
> these two implementations is just an initial accumulator value. But of
> course, I may be missing something. It would be nice to have a look at your
> NEON code.
NEON multiplies use same-size operands, so a 16-bit vector can't be
multiplied with a 32-bit value. The accumulator can, however, be
32-bit.
My NEON code is available from git://git.mansr.com/ffmpeg.mru in
branch arm-neon.
>> Is your armv5te idct a total rewrite of what's in svn, or can the
>> changes be broken into sensible steps? If the latter, please send a
>> patch series.
>
> Do you remember our chat in IRC after you added your implementation of
> ARMv5TE IDCT to SVN? I said that it was a nice improvement over all the
> IDCT available at that point for ARM, but still far from optimal. You
> said that further improvements would be welcome. So here we are.
>
> I ended up mostly keeping all the code structure (I actually tried
> to modify code to process 1 column per iteration, but processing 2
> columns still was faster). But all the code had to be changed for
> better instructions scheduling and registers allocation.
>
> It can be artificially split into several patches, replacing all parts
> of your IDCT one at a time, but are these steps really sensible?
If the end result is a complete rewrite, there's little sense in that.
Had there been an overall resemblance, a patch series might have made
sense.
> This further optimization of ARMv5TE IDCT is a result of work approved
> and encouraged by you, so you should have some responsibility too ;)
>
> I wish we had a friendly community of ARM developers here, who would help
> each other. After all, am I the only one here who is interested in getting
> much faster ARMv5TE IDCT for FFmpeg?
I have faster ARM devices to play with now ;-)
> Could you or anybody else having compatible ARM device just do some
> benchmarking to confirm my results (I posted benchmarks here multiple
> times already). It would be a really good help. Because I feel that
> some people here still doubt that it provides a major performance
> improvement.
>
> In order to do this, you need to:
>
> 1. Download this source file and drop it over current 'simple_idct_armv5te.S':
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup
> 2. Change MAX_NEG_CROP in 'simple_idct_armv5te.S' to 1024
> 3. Compile FFmpeg and do some benchmarks
>
> Once/if the performance improvement is confirmed, a help with integration
> would be really needed. That's not a joke, I really fail to see any problems
> with the "balign/ASMALIGN/stack alignment" stuff, so I can't fix them. A good
> example of a solution (a working patch) is very much welcome.
>
> Does NEON optimized code require more than 32-bit alignment for data and
> stack? If yes, I could just wait a bit and use your NEON code as an example
> how do this stuff in FFmpeg-approved way.
NEON supports both aligned (up to 32-byte) an unaligned memory access.
Aligned is obviously faster. My NEON IDCT doesn't use the stack at all,
so stack alignment is not an issue there.
>> >> > Ideally would be the authors who claimed the code to be identical to
>> >> > the C code ...
>> >>
>> >> I wrote the ARMv6 version, but I never made any such claim. In fact,
>> >> I believe I mentioned at the time that there was a slight difference.
>> >>
>> >> > If we have noone then we will likely have to disable these IDCTs. I do
>> >> > not want to create files that turn green and pink unless they are
>> >> > played on an ARM cpu ...
>> >>
>> >> I don't think the ARM CPUs where these apply will be used mostly for
>> >> playback, not encoding, and on those machines every cycle counts.
>> >
>> > Yes, that was one of the reasons why I did not strongly insist on
>> > disabling j_rev_dct_ARM that time (people could get a severe performance
>> > regressions and complain about it) :)
>> >
>> > In any case, ARMv6 idct still needs heavy optimizations, it is not very
>> > fast (on its target devices with ARM11 CPUs of course).
>>
>> Well, it's considerably faster than the C IDCT, but I'm not denying it
>> could be improved. Are you talking about sparse data handling, or
>> something else?
>
> It has quite a number of performance problems:
> 1. it's not using 64-bit load instructions (and dual loads are almost as
> useful as dual multiplies for IDCT)
Odd, I can't remember why I didn't use those.
> 2. heavy functions call/ret overhead (ordinary loops are a lot faster)
Why are function calls so slow?
> 3. sparse data handling
Yes, that is a bit lacking.
> It can be clearly seen if you try to benchmark it against optimized
> ARMv5TE IDCT on ARM11, and ARMv6 IDCT fails to provide any significant
> performance improvement (actually performance is roughly the same and
> I can't say which one is faster). Also ARMv5TE optimized IDCT can be
> easily modified to use ARMv6 saturation instructions instead of table
> lookups. That should save at least 64 cycles per _add/_put function
> and I suspect that this hacked ARMv5TE IDCT would easily leap ahead
> and outperform your ARMv6 IDCT.
>
> We have the following hardware features list:
> "dual loads", "dual multiplies", "saturation"
>
> Your IDCT uses only "dual multiplies" and "saturation"
>
> ARMv5TE IDCT uses only "dual loads", but can be trivially modified to use
> "saturation" too
How fast are the v5TE instructions on v6 compared to v6 SIMD instructions?
On Cortex-A8 they are slower than on ARM9, presumably because NEON is
the favoured way of doing things there.
> Optimal IDCT implementation should use them all, we don't have spare CPU
> cycles on embedded devices.
Indeed.
>> Did you have any patches?
>
> I just know how to fix it :) But we are not finished with ARMv5TE yet...
>
>> Remind me again what ARM devices you have. I recently got my hands on
>> a Cortex-A8 (TI OMAP3530), and it's more fun to play around with than
>> the Nokia tablets.
>
> Right now I only have ARM926EJ-S core in TI OMAP1710 (Nokia 770) and
> ARM1136JF-S in TI OMAP2420 (Nokia N800). But I hope to get this one soon
> (well, once they become available for general public): http://beagleboard.org
That's what I have too, including a Beagleboard rev B. I hardly touch
the 770, so if anyone wants it, give me an offer.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list