[FFmpeg-devel] [PATCH] Altivec version of-altivec h264_h-v_loop_filter_luma
Guillaume POIRIER
poirierg
Fri May 11 23:18:15 CEST 2007
Hi,
On 5/11/07, Luca Barbato <lu_zero at gentoo.org> wrote:
> Guillaume POIRIER wrote:
> >
> > is that any better?
>
> yes, thank you
>
> > +/* A routine to read an unaligned vector. Thanks for the example code Apple */
> > +static inline vector unsigned char read_unaligned(int offset, uint8_t *src)
>
> I'd move to a common header with a doxy comment.
Ok, that can certainly be done.
> > +#define transpose4x16(r0, r1, r2, r3) { \
> > + register vec_u8_t r4; \
> > + register vec_u8_t r5; \
> > + register vec_u8_t r6; \
> > + register vec_u8_t r7; \
> > + \
> > + r4 = vec_mergeh(r0, r2); /*0, 2 set 0*/ \
> > + r5 = vec_mergel(r0, r2); /*0, 2 set 1*/ \
> > + r6 = vec_mergeh(r1, r3); /*1, 3 set 0*/ \
> > + r7 = vec_mergel(r1, r3); /*1, 3 set 1*/ \
> > + \
> > + r0 = vec_mergeh(r4, r6); /*all set 0*/ \
> > + r1 = vec_mergel(r4, r6); /*all set 1*/ \
> > + r2 = vec_mergeh(r5, r7); /*all set 2*/ \
> > + r3 = vec_mergel(r5, r7); /*all set 3*/ \
> > +}
> > +
> > +static inline void write16x4(uint8_t *dst, int dst_stride,
> > + register vec_u8_t r0, register vec_u8_t r1,
> > + register vec_u8_t r2, register vec_u8_t r3) {
> > + DECLARE_ALIGNED_16(unsigned char, result[64]);
> > + uint32_t *src_int = (uint32_t *)result, *dst_int = (uint32_t *)dst;
> > + int int_dst_stride = dst_stride/4;
> > +
> > + vec_st(r0, 0, result);
> > + vec_st(r1, 16, result);
> > + vec_st(r2, 32, result);
> > + vec_st(r3, 48, result);
> > + /* there has to be a better way!!!! */
> > + *dst_int = *src_int;
> > + *(dst_int+ int_dst_stride) = *(src_int + 1);
> > + *(dst_int+ 2*int_dst_stride) = *(src_int + 2);
> > + *(dst_int+ 3*int_dst_stride) = *(src_int + 3);
> > + *(dst_int+ 4*int_dst_stride) = *(src_int + 4);
> > + *(dst_int+ 5*int_dst_stride) = *(src_int + 5);
> > + *(dst_int+ 6*int_dst_stride) = *(src_int + 6);
> > + *(dst_int+ 7*int_dst_stride) = *(src_int + 7);
> > + *(dst_int+ 8*int_dst_stride) = *(src_int + 8);
> > + *(dst_int+ 9*int_dst_stride) = *(src_int + 9);
> > + *(dst_int+10*int_dst_stride) = *(src_int + 10);
> > + *(dst_int+11*int_dst_stride) = *(src_int + 11);
> > + *(dst_int+12*int_dst_stride) = *(src_int + 12);
> > + *(dst_int+13*int_dst_stride) = *(src_int + 13);
> > + *(dst_int+14*int_dst_stride) = *(src_int + 14);
> > + *(dst_int+15*int_dst_stride) = *(src_int + 15);
> > +}
> > +
> > +/* This function does an 6x16 transpose on data in src, and stores it in dst */
> > +#define readAndTranspose16x6(src, src_stride, r8, r9, r10, r11, r12, r13) {\
>
> won't be possible to factorize something in order to spare some lvsl ?
Maybe, I don't know how.
> > + register vec_u8_t r0 = read_unaligned(0, src);\
> > + register vec_u8_t r1 = read_unaligned( src_stride, src);\
> > + register vec_u8_t r2 = read_unaligned(2* src_stride, src);\
> > + register vec_u8_t r3 = read_unaligned(3* src_stride, src);\
> > + register vec_u8_t r4 = read_unaligned(4* src_stride, src);\
> > + register vec_u8_t r5 = read_unaligned(5* src_stride, src);\
> > + register vec_u8_t r6 = read_unaligned(6* src_stride, src);\
> > + register vec_u8_t r7 = read_unaligned(7* src_stride, src);\
> > + register vec_u8_t r14 = read_unaligned(14*src_stride, src);\
> > + register vec_u8_t r15 = read_unaligned(15*src_stride, src);\
> > + \
> > + r8 = read_unaligned( 8*src_stride, src); \
> > + r9 = read_unaligned( 9*src_stride, src); \
> > + r10 = read_unaligned(10*src_stride, src); \
> > + r11 = read_unaligned(11*src_stride, src); \
> > + r12 = read_unaligned(12*src_stride, src); \
> > + r13 = read_unaligned(13*src_stride, src); \
> > + \
> > +// out: o = |x-y| < a
> > +static inline vec_u8_t diff_lt_altivec (register vec_u8_t x,
> > + register vec_u8_t y,
> > + register vec_u8_t a) {
> > +
>
> There isn't a simpler way?
Maybe, I don't know how.
> > + register vec_u8_t diff = vec_subs(x, y);
> > + register vec_u8_t diffneg = vec_subs(y, x);
> > + register vec_u8_t o = vec_or(diff, diffneg); /* |x-y| */
> > + o = vec_cmplt(o, a);
> > + return o;
> > +}
>
> I'm too tired to read further...
:-(
As I said, I submitted this patch in order to have PPC users get some
speed-up now rather than having a hypothetic optimal code when some of
us who work on Altivec sit down and work on it.
I do think it's better to have a committed faster code that leaves
room for improvement than a fastest code that never sees the light.
Guillaume
--
Rich, you're forgetting one thing here: *everybody* except you is
stupid.
M?ns Rullg?rd
More information about the ffmpeg-devel
mailing list