[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Mon May 19 20:22:46 CEST 2008

On Monday 19 May 2008, Dmitry Antipov wrote:
> Here you can find 15 WMMX versions of it: http://78.153.153.8/tmp/pix_sum.c
> And here is an example output: http://78.153.153.8/tmp/pix_sum.txt (it's an
> average results among a few tens runs).
>
> Some conclusions on top on this work:
>
>   - inserting some work (if exists) between loads is always good;

Yes, because Intel WMMX2 optimization manual clearly states that you need
to separate loads from each other.

>   - backward loading (from addr + 8 first, then from addr) is not slower
> than forward loading (from addr first, then from addr + 8);

It is true for your test, because all of your test data is in cache. I only
said that forward loading is at least as fast as backward loading or faster.
If forward loading can be implemented without compromising anything else, it
is better to use it.

The attached code (test_cachemiss.c) demonstrates that ARM11 is faster when it
is not reading data backwards on cache misses:
testfunc (read forward) sum=-917504  : time=26.432s
testfunc (read backward) sum=-917504 : time=29.581s
testfunc (read forward) sum=-917504  : time=26.391s
testfunc (read backward) sum=-917504 : time=29.494s
testfunc (read forward) sum=-917504  : time=26.388s
testfunc (read backward) sum=-917504 : time=29.480s

You can try to run it on your XScale CPU, if the times will be the same for
both memory access patterns, it would mean that XScale does not use "critical
word first" cache linefill. And I will not complain about this backwards read
issue anymore :)

>   - explicit prefetching with 'pld [addr]' introduces slowdowns both for
>     forward and backward loading (BTW, I want to spent some more time on
>     learning prefetching and other cache tricks);

Yes, because it is not free (takes 1 cycle) and is not needed here (for this
kind of synthetic test). I have no idea if it can help on real video encoding
though.

>   - WMMX2 pre-increment (wldrd wr0, [%0, %1]!) has no advantage over
> post-increment (wldrd wr0, [%0], %1) and both of them has no advantage over
> convenient explicit add if the latter may be done within wldr stall slot
> (wldrd wr0, [%0], then add %0, %0, %1);

You don't need explicit "add" because you can hide latencies using other
instructions and get much faster code.

>   - unrolling is good in general, but each level of unrolling needs to be
> precisely benchmarked. Note
> pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll8 is a bit slower
> than pix_sum_iwmmxt__loads_backward_postincrement_wsadb_unroll4 (I suspect
> this is a nasty (mis?)feature of the core's instruction cache).

That difference is too small, it may be just a measurement error, also branch
prediction should work better for unroll4 case.

For example, we get the following pattern for unroll4:
taken,taken,taken,not-taken,taken,taken,taken,not-taken

For unroll8 branch pattern would be:
taken,not-taken,taken,not-taken,taken,not-taken,taken,not-taken

Test program which performs branch prediction simulation is attached in
"branch_sim.c" file (algorithm taken from XScale manual). Running it 
produces the following results:

unroll8 (100 simulated runs), total branch cycles=600, avg=3.000
unroll4 (100 simulated runs), total branch cycles=800, avg=2.000

Total overhead of branches for unroll4 is still higher, but efficiency of
prediction is better (2 cycles per branch instruction vs. 3 cycles). Branch
predictor works best if the probabilities of "taken"/"not-taken" cases differ
much.

Knowing how branch predictor works also helps optimizing code. IIRC, the
same algorithm of branch prediction was also used in P1. Modern x86 desktop
CPUs have much more advanced (and highly "secret") branch predictors, but the
idea is still the same. 

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_cachemiss.c
Type: text/x-csrc
Size: 2875 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080519/f4e13471/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: branch_sim.c
Type: text/x-csrc
Size: 1444 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080519/f4e13471/attachment-0001.c>