[FFmpeg-devel] [RFC] optimize ff_emulated_edge_mc

Sun Jan 9 23:35:49 CET 2011

On Sun, Jan 09, 2011 at 03:51:07PM -0500, Ronald S. Bultje wrote:
> Hi,
> 
> On Mon, Jan 3, 2011 at 9:39 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> > On Mon, Jan 3, 2011 at 8:59 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >> On Sun, Jan 02, 2011 at 10:30:50PM -0500, Ronald S. Bultje wrote:
> >>> On Sun, Jan 2, 2011 at 5:59 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >>> > On Sun, Jan 02, 2011 at 01:05:43PM -0500, Ronald S. Bultje wrote:
> >>> >> On Thu, Dec 30, 2010 at 5:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> >>> >> > On Wed, Dec 29, 2010 at 10:03:04PM -0500, Ronald S. Bultje wrote:
> >>> >> >> On Wed, Dec 29, 2010 at 8:06 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> >>> >> >> > emu_edge_mc looks optimizable and shows up in my profilings. A simple
> >>> >> >> > loop->memcpy makes things a lot faster already (see attached):
> >>> >> >> [..]
> >>> >> >> > after
> >>> >> >> [..]
> >>> >> >> > 6165 dezicycles in ff_emulated_edge_mc, 1048040 runs, 536 skips
> >>> >> >> > 6115 dezicycles in ff_emulated_edge_mc, 1048044 runs, 532 skips
> >>> >> >> > 6087 dezicycles in ff_emulated_edge_mc, 1048158 runs, 418 skips
> >>> >> >> >
> >>> >> >> > before
> >>> >> >> [..]
> >>> >> >> > 9104 dezicycles in ff_emulated_edge_mc, 1047805 runs, 771 skips
> >>> >> >> > 9131 dezicycles in ff_emulated_edge_mc, 1047866 runs, 710 skips
> >>> >> >> > 9097 dezicycles in ff_emulated_edge_mc, 1047874 runs, 702 skips
> >>> >> >> [..]
> >>> >> >>
> >>> >> >> Another few more changes attached, doing memcpy() on top/bottom edge
> >>> >> >> brings it to 540 cycles:
> >>> >> >>
> >>> >> >> 5414 dezicycles in ff_emulated_edge_mc, 1048331 runs, 245 skips
> >>> >> >>
> >>> >> >> and then reordering the left/right edge loop a little brings it to 520:
> >>> >> >>
> >>> >> >> 5186 dezicycles in ff_emulated_edge_mc, 1048288 runs, 288 skips
> >>> >> >>
> >>> >> >> I'm too lazy to run this multiple times.
> >>> >> >>
> >>> >> >> For the left/right edge fills, I tried using memset(), but that slows
> >>> >> >> it down considerably, it appears it doesn't inline it. Jason said he
> >>> >> >> saw the same on some compilers withthe memcpy() trick. Which makes me
> >>> >> >> think, maybe we can emulate the inline memset() trick with some more
> >>> >> >> elaborate C code? What I'm thinking is basically edge_val *=
> >>> >> >> 0x01010101U; while (to_write >= 4) write(edge_val); if (to_write&2)
> >>> >> >> write(edge_val); if (to_write & 1) write(edge_val); or so. Also, since
> >>> >> >> most time is spent in copying the blocks quite literally, the main
> >>> >> >> copy block could certainly use some optimizations, especially since
> >>> >> >> width is generally something like 16...
> >>> >> >>
> >>> >> >> Ronald
> >>> >> >
> >>> >> >> ?dsputil.c | ? 22 ++++++++++------------
> >>> >> >> ?1 file changed, 10 insertions(+), 12 deletions(-)
> >>> >> >> 6b5be1a69247178dd53af1f622a49750d231045d ?emu_edge_mc.patch
> >>> >> >
> >>> >> > feel free to commit whatever makes ff_emulated_edge_mc() faster
> >>> >>
> >>> >> Attached is a more reviewable version. It contains basically similar
> >>> >> changes as above to the C version, plus I've added the function to
> >>> >> DSPContext and have all decoders use it. It's now (for VP8) down from
> >>> >> >1000 cycles (see above) to ~259 cycles, or 4x as fast as original and
> >>> >> about 2x as fast as the faster C variant in my original post. All this
> >>> >> on a Core i7, Elephants Dream sample on a Macbook Pro / OSX 10.6.
> >>> >>
> >>> >> Here's what it does different than the C version:
> >>> >> - memcpy-style copy of top/bottom edge and body uses movdqu and then
> >>> >> only mov for the remaining 8/4/2/1 bytes
> >>> >> - left/right edge writing decision is made once, and then the loop is
> >>> >> largely branchless - this could be done for the C version also perhaps
> >>> >> - the left/right edges are written two bytes at a time (makes a little
> >>> >> bit of a difference, I tried 4/8 bytes also but that's slower,
> >>> >> probably because we now need to ensure we write the correct amount of
> >>> >> bytes, whereas for 2, we can overwrite by one into the edge pixel
> >>> >> itself and then it doesn't matter I like how you can mov %al, %ah
> >>> >> without destroying the lower 8bits, unfortunate that that's not
> >>> >> possible for any part of the general registers (or xmm/mmx
> >>> >> registers)...
> >>> >
> >>> > mov al,ah with *ax being used afterwards
> >>> > has speed issues on some cpus
> >>> > what about some *mul by 257 ?
> >>>
> >>> That was a lot slower than the mov al, ah. Do you know which CPUs it's
> >>> supposed to be slower at so I can test & compare and possibly set up a
> >>> compile-time variant of both?
> >>
> >> ppro/p2/p3 should have a 5+ cycle stall writing into part of a register and
> >> reading the whole unless the whole was zeroed by xor before
> >> on later intel cpus this isnt free either but cheaper
> >
> > Does anyone have a SSH slot on a ppro/p2/p3 for me to test this on? :-).
> 
> Ping, anyone? Without a SSH slot to test, I'll want to apply what I
> have, since it's clearly faster on nowadays-generations on CPUs (a mul
> was several cycles slower than a mov al, ah).

I cant give you an account as my gpg key is on that system and iam paranoid
but ill benchmark this code a bit

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No human being will ever know the Truth, for even if they happen to say it
by chance, they would not even known they had done so. -- Xenophanes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110109/b5493461/attachment.pgp>