[FFmpeg-devel] Indeo3 replacement, take 3

Mon Nov 2 03:08:34 CET 2009

Vitor Sessak schrieb:
> [...]
>>
>> Note that since plane->pixels[] is not aligned, dst is not aligned
>> neither. So I'd suggest something on the lines of
>>
>>> typedef struct Plane {
>>>     uint8_t         *buffers[2];
>>>     DECLARE_ALIGNED_16(uint8_t, *pixels[2]); ///< pointer to the
>>> actual pixel data of the buffers above
>>>     uint32_t        width;
>>>     uint32_t        height;
>>>     uint32_t        pitch;
>>> } Plane;
>
> Err, scrap that, I see that pixels[] are pointers to av_malloc'ed
> buffers, hence aligned. So no ideas here. Does anyone know the actual
> alignment requirements of dsp.put_no_rnd_pixels_tab? It is documented
> nowhere...
>

Two days with the GDB and I've found out what's up! The problem resides
in the function "put_pixels16_altivec()" from dsputils_ppc.c and is
caused actually by the wrong alignment, namely:

the AltiVec instruction "stvx vec, offset, addr" stores only the partial
vector (8 bytes) if the "addr" is aligned on 8-bytes boundary and the
whole vector (16 bytes) if the "addr" is aligned on 16-bytes boundary.
The cells in indeo3 are always 8-byte aligned, therefore I'll get only
those cells FULLY COPIED whose memory locations are aligned on 16-bytes
boundaries! All others will be only partially copied!!!

I'm observing this behaviour only on PPC with AltiVec. MMX-optimized
code works well because it requires only 8-byte alignment.

I attached a modified patch fixing this problem. Plz take a look at
"copy_cell()"...

ANOTHER QUESTION: I found out that the 8xH block coping is done using
32bit variables in "put_pixels8_c". That copy routine can be optimized
using one FPU's lfd and stfd instruction pair per line just like MMX.
PPC's FPU is register-based, therefore it should be faster as the 32bit
version but I didn't tested it though. An asm-example copying 4 8-byte
lines at once:

lfd     fp0, src
lfd     fp1, src + linesize
lfd     fp2, src + linesize * 2
lfd     fp3, src + linesize * 3
stfd   fp0, dst
stfd   fp1, dst + linesize
stfd   fp2, dst + linesize * 2
stfd   fp3, dst + linesize * 3

Surely only copying can be optimized this way, not averaging! Would an
appropriate patch make a sense?

Regards
maxim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indeo3.c
Type: text/x-csrc
Size: 49267 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091102/6fb51a5c/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indeo3data.h
Type: text/x-chdr
Size: 38594 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091102/6fb51a5c/attachment.h>