[FFmpeg-devel] [PATCH 2/2] Add hflip filter.

Mon Aug 16 15:25:59 CEST 2010

"Ronald S. Bultje" <rsbultje at gmail.com> writes:

> Hi,
>
> On Thu, Aug 12, 2010 at 2:35 PM, Stefano Sabatini
> <stefano.sabatini-lala at poste.it> wrote:
>> On date Thursday 2010-08-12 12:49:25 -0400, Ronald S. Bultje encoded:
>>> On Thu, Aug 12, 2010 at 12:39 PM, Stefano Sabatini
>>> <stefano.sabatini-lala at poste.it> wrote:
>>> > On date Wednesday 2010-08-04 14:23:49 +0200, Michael Niedermayer encoded:
>>> >> On Sat, Jul 31, 2010 at 02:07:29AM +0200, Stefano Sabatini wrote:
>>> > [...]
>>> >> > +static void draw_slice(AVFilterLink *inlink, int y, int h, int slice_dir)
>>> >> > +{
>>> >> > + ? ?FlipContext *flip = inlink->dst->priv;
>>> >> > + ? ?AVFilterPicRef *inpic ?= inlink->cur_pic;
>>> >> > + ? ?AVFilterPicRef *outpic = inlink->dst->outputs[0]->outpic;
>>> >> > + ? ?uint8_t *inrow, *outrow;
>>> >> > + ? ?int i, j, plane, step, hsub, vsub;
>>> >> > +
>>> >> > + ? ?for (plane = 0; plane < 4 && inpic->data[plane]; plane++) {
>>> >> > + ? ? ? ?step = flip->max_step[plane];
>>> >> > + ? ? ? ?hsub = (plane == 1 || plane == 2) ? flip->hsub : 0;
>>> >> > + ? ? ? ?vsub = (plane == 1 || plane == 2) ? flip->vsub : 0;
>>> >> > +
>>> >> > + ? ? ? ?outrow = outpic->data[plane] + (y>>vsub) * outpic->linesize[plane];
>>> >> > + ? ? ? ?inrow ?= inpic ->data[plane] + (y>>vsub) * inpic ->linesize[plane] + ((inlink->w >> hsub) - 1) * step;
>>> >> > + ? ? ? ?for (i = 0; i < h>>vsub; i++) {
>>> >> > + ? ? ? ? ? ?for (j = 0; j < (inlink->w >> hsub); j++)
>>> >> > + ? ? ? ? ? ? ? ?memcpy(outrow + j*step, inrow - j*step, step);
>>> >>
>>> >> variable length memcpy on a per pixel base is slow
>>> >
>>> > Updated.
>>> >
>>> > I didn't manage to understand how bswap/dsputils may be used, I don't
>>> > know if that would improve it.
>>>
>>> You could create a VideoFilterDSPContext (or a
>>> HFlipVideoFilterDSPContext), add a function hflip to it, and then any
>>> one of us could optimize it. E.g. for RGBA32, where step is probably
>>> 4, we would read it as 8/16-bytes-at-once, flip them using e.g. pshufw
>>> or something, (do the same for the opposite pixels at the end of the
>>> row, ) and then write them out again -> you just did 2x 2/4 pixels at
>>> once. By using multiple registries and making sure there's enough
>>> padding (which I think is always the case), this'd get even faster,
>>> also because for at least the left read/write, we can use aligned r/w
>>> which is faster.
>>>
>>> Not sure if that's what Michael meant, but I guess it's sort of in the
>>> right direction.
>>
>> OK I see thanks, I suggest anyway to commit this simple variant, and
>> then work on the optimizations.
> [..]
>> +            case 3:
>> +            {
>> +                uint8_t *in  =  inrow;
>> +                uint8_t *out = outrow;
>> +                for (j = 0; j < (inlink->w >> hsub); j++, out += 3, in -= 3) {
>> +                    out[0] = in[0];
>> +                    out[1] = in[1];
>> +                    out[2] = in[2];
>> +                }
>> +            }
>> +            break;
>
> You can use a uint16+t + uint8_t write here instead of 3 uint8_t writes.

Better still, use AV_[RW]B24() or the bytestream macros, which will in
theory do the right thing.

-- 
M?ns Rullg?rd
mans at mansr.com