[FFmpeg-devel] [PATCH] JPEG2000: SSE optimisation of DWT decoding
Ivan Kalvachev
ikalvachev at gmail.com
Fri Aug 11 18:32:37 EEST 2017
On 8/10/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
>> From: Ivan Kalvachev <ikalvachev at gmail.com>
>> On 8/8/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
>> > From: Maxime Taisant <maximetaisant at hotmail.fr>
>> >
>> > Hi,
>> >
>> > Here is some SSE optimisations for the dwt function used to decode JPEG2000.
>> > I tested this code by using the time command while reading a JPEG2000
>> > encoded video with ffmpeg and, on average, I observed a 4.05% general
>> > improvement, and a 12.67% improvement on the dwt decoding part alone.
BTW, forgot to tell you that FFmpeg has its own benchmarking macros
that counts the cpu cycles that it takes for a execution of block of C code.
Use it (in .c files) like this
#include "libavutil/timer.h"
{
START_TIMER
function();
STOP_TIMER("function")
}
The functions would output the results to stderr,
and they would use the name you provide to them.
(So you can benchmark more than one thing at a time).
Make sure the function(s) runs a lot per benchmark run.
The macro would show results at log2 measures.
Do more (3 or 5) separate benchmark runs, since
the final results always slightly differs. (function could
be interrupted, and if detected the measurement
would be discarded/skipped).
Also, try the macro without any function inside,
so you have the "NULL" function. The stop_timer
has an instruction fence opcode, that blocks until
all prior microcodes are executed and this could
take a while. Your benchmarks could never get
faster than the NULL function.
>> > In the nasm code, you can notice that the SR1DFLOAT macro appear
>> > twice. One version is called in the nasm code by the HORSD macro and
>> > the other is called in the C code of the dwt function, I couldn't
>> > figure out a way to make only one macro.
>>
>> You want to use the same macro at two locations or you want to have
>> 1 function and "call" it from 2 places?
>>
>> For the former, I'd guess that you might have been getting errors
>> about duplicated labels, since you use the local to the file form instead
>> local to the macro form. aka: ".loop" vs "%%loop".
>
> Currently I have one function declared with "cglobal" and called in the C
> code, and one macro with exactly the same behavior used in the nasm code.
> So I guess I would like to keep only one of the two and call it from both
> places. (Sorry if it's still not clear, English is not my native language).
Now I'm even more confused... ;)
Don't worry, I'm also not native English speaker.
>> > I also couldn't figure out a good way to optimize the VER_SD part, so
>> > that is why I left it unchanged, with just a SSE-optimized version of
>> > the SR_1D_FLOAT function.
>>
>> [...]
>> > +.extend:
>> > + shl i0d, 2
>> > + shl i1d, 2
>> > + mov j0q, i0q
>> > + mov j1q, i1q
>> > + movups m0, [lineq+j0q+4]
>> > + shufps m0, m0, 0x1B
>>
>> The x86inc provides with readable method for the shuffle constant.
>> qXXXX where X is index in the source reg.
>> Using q3210 would generate constant that leaves all elements at their
>> original places.
>> The 0x1B is q0123 , that is swap, isn't it?.
>>
>> Also, minor cosmetic nitpick.
>> usually the first parameters are placed so their commas are vertically
>> aligned.
>> This applies only when the parameter is register (so no jmp labels or []
>> addresses ).
>>
>
> Ok, I will change all that.
>
>> [...]
>> > + ;line{2*i,2*(i+1),2*(i+2),2*(i+3)} -=
>> > F_LFTG_DELTA*(line{2*i-1,2*(i+1)-1,2*(i+2)-1,2*(i+3)-
>> 1}+line{2*i+1,2*(
>> > i+1)+1,2*(i+2)+1,2*(i+3)+1})
>> > + movups m0, [lineq+2*j0q-28]
>> > + movups m4, [lineq+2*j0q-12]
>> > + movups m1, m0
>> > + shufps m0, m4, 0xDD
>> > + shufps m1, m4, 0x88
>>
>> The x86inc provides with a way to emulate 3 operand avx.
>> This means it hides one of the movaps (use 'a' for reg reg).
>> shufps m1, m0, m4, 0x88
>> shufps m0, m4, 0xDD
>
> I know, but I figured that I would do a sse version first and add avx
> support afterwards.
On 8/11/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
>
>> From: Maxime Taisant <maximetaisant at hotmail.fr>
>>
>> > From: Ivan Kalvachev <ikalvachev at gmail.com>
>> >
>> > On 8/8/17, maxime taisant <maximetaisant at hotmail.fr> wrote:
>> > > From: Maxime Taisant <maximetaisant at hotmail.fr>
>> > >
>> > > + movups m2, [lineq+2*j0q-24]
>> > > + movups m5, [lineq+2*j0q-8]
>> > > + shufps m2, m5, 0xDD
>> > > + addps m2, m1
>> > > + mulps m2, m3
>> > > + subps m0, m2
>> > > + movups m4, m1
>> > > + shufps m1, m0, 0x44 ; 0100'0100 q1010
>> > Is that movlhps m1, m0 ?
>>
>> No, this command place the first two values of m1 in the last two
>> doublewords of m1, and the first two values of m0 in the first two
>> doublewords of m1.
>> Movhlps would simply replace the first two values of m1 by the ones
>> of m0.
>>
>
> Ok, so everything I said here is wrong...
> You were right, that IS movlhps m1, m0.
> Will change that, sorry.
I do not insist replacing the shufps with movlhps,
most modern CPU execute both for the same cycles.
(Well, some old AMD cpu do have movlhps faster and
it might execute at different subdomain...
Run some benchmarks.)
Just make sure your shuffles are correct.
Best Regards
More information about the ffmpeg-devel
mailing list