# [FFmpeg-devel] A further potential optimization of H264 deblocking

Guy Bonneau gbonneau
Tue May 29 19:58:08 CEST 2007

```While I was studying the deblocking algorithm of the H264 specification I
dug in the ffmpeg mmx implementation to understand how this is implemented.
I had a hard time to understand the optimization so I went through the
Binary Math. Yet while doing the Math I think I might have found a small
further optimization.

Let start from:

(((q0-p0)<<2) + (p1-q1) + 4) >> 3????? (1)

The first 2 Least Significant Bit of result (p1-q1) doesn?t add to the
result. Thus they can be dropped.
And we can rewrite the equation to:

(((q0-p0)) + ((p1-q1) >> 2) + 1) >> 1??? (2)

We have the identity

(a-b) = a+(~b)+1 ? 256???(Note a and b are unsigned value)

Thus we can rewrite (2) :

(((q0+~p0 + 1 - 256)) + ((p1+~q1 + 1 - 256) >> 2) + 1) >> 1

And trying to use PAVGB we can do some binary mathematic:
???
(((q0+~p0 + 1 - 256)) + ((p1+~q1 + 1) >> 2) - 64 + 1)? >> 1
(((q0+~p0 + 1)) + (PAVGB(p1,~q1) >> 1) - 256 - 64 + 1)? >> 1
(((q0+~p0 + 1)) + (PAVGB(p1,~q1) + 4 ) >> 1) - 256 - 64 ? (4>>1) + 1)?>> 1
(((q0+~p0 + 1)) + (PAVGB(p1,~q1) + 3?+ 1) >> 1) - 256 - 64 ? (4>>1) +1)?>> 1
(((q0+~p0 + 1)) + (PAVGB(p1,~q1) + 3?+ 1) >> 1) - 256 - 64 ? 2 + 1)?>> 1
(((q0+~p0 + 1)) + PAVGB(PAVGB(p1,~q1), 3)?? - 256 - 64 ? 2 + 1)? >> 1
(((q0+~p0 + 1)) + PAVGB(PAVGB(p1,~q1), 3) + 1)? >> 1?- 128 ? 33

Let replace? PAVGB(PAVGB(p1,~q1), 3) as val1
Let replace?? q0+~p0 + 1 as val2

Then we have

val1 + val2 + 1 >> 1

and

PAVGB((q0+~p0+1), PAVGB(PAVGB(p1,~q1), 3))?>> 1? - 161

I think it is possible to rewrite the code to:

#define H264_DEBLOCK_P0_Q0(pb_01, pb_3f)\
"pcmpeqb %%mm4              , %%mm4 \n\t"\
"pxor    %%mm4              , %%mm3 \n\t"\
"pavgb   %%mm0              , %%mm3 \n\t" /* (p1 - q1 + 256)>>1*/\
"pavgb   "MANGLE(ff_pb_3)"  , %%mm3 \n\t" /*(((p1 - q1 + 256)>>1)+4)>>1 =
64+2+(p1-q1)>>2*/\
"pxor    %%mm1              , %%mm4 \n\t" /*~p0*/\
"paddb   %%mm2              , %%mm4 \n\t" /* (q0 + ~p0 + 1)*/\
"pavgb   %%mm4              , %%mm3 \n\t" /* d+128+33*/\
"movq    "MANGLE(ff_pb_A1)" , %%mm6 \n\t"\
"psubusb %%mm3              , %%mm6 \n\t"\
"psubusb "MANGLE(ff_pb_A1)" , %%mm3 \n\t"\
"pminub  %%mm7              , %%mm6 \n\t"\
"pminub  %%mm7              , %%mm3 \n\t"\
"psubusb %%mm6              , %%mm1 \n\t"\
"psubusb %%mm3              , %%mm2 \n\t"\

Its is a 3 instructions gain over 20 instructions.

Unfortunately I cannot try it myself since I'm using Windows
and this is theoretical stuff. If someone can do it I would
be interested to know if it works or I missed something.

Thanks
Guy Bonneau

```