[Ffmpeg-devel] patch: altivec optimizations for h264 decoder
Mon Feb 6 19:56:58 CET 2006
Romain Dolbeau wrote:
> They probably do. It would be intesrting to know what OS and compiler
> the author of the patches used (I don't have linux/ppc anymore).
I am using Mac OS-X (Darwin Kernel Version 7.9.0) with gcc-3.3.3 on a G5
> Patch 1 : nothing to add, except that gcc register allocator is probably
> going to hate ff_h264_idct_add_altivec_mat
Why do you think so?. This algorithm has more instructions than the
factorized-matrix that is implemented in the C version but it can take
more advantage of the altivec instructions by reducing the data
reorganization (matrix transpose and so on).
> Patch 2 : in PREFIX_h264_qpel4_hv_lowpass_altivec, why use
> VEC_LOAD_UNALIGNED_CHECK ? tmpbis is computed from tmp
> (comments -> assumed aligned) and tmpStride (comments ->
> multiple of 16), so it has to be aligned.
Well, the problem here is with the h264_qpel4_mc22_altivec function
which passes to qpel4_hv_lowpass_altivec the value 4 as a stride for the
tmp array. Because of that I have to check and align the data for
loading the temp results in the second part of
h264_qpel4_hv_lowpass_altivec. I agree with you that this is a lot of
overhead. One way to eliminate this is to change h264_qpel4_mc22_altivec
in order to pass always 8 as a stride for the tmp array and also change
the size of that array. I think that this stride can be 8 (to a pointer
to vector signed short) for all the mc22 functions: qpel16_mc22,
qpel8_mc22 and qpel4_mc22. In this way there will no be alignment problems.
OPNAME ## h264_qpel ## SIZE ## _hv_lowpass_ ## CODETYPE(dst, tmp,
src, stride, SIZE, stride);
---> change "SIZE" here for "8".
> Patch 4 : is put_pixels8_altivec really faster than the C
> version ? there's not computation whatsoever, and with the
> need to load the destination block to insert the new
> data, it may be slower to use AltiVec than regular C code.
I have not tested this, I only added put_pixels8_altivec because
put_h264_qpel8_mc00_altivec requires it. May be it is slower that the C
version I'm not sure, I am going to make a deeper analysis of this.
BTW I was trying to implement put_pixels16_l2_altivec and
put_pixels8_l2_altivec using the vec_avg instruction, but always I found
evident artifacts in the resulting videos. Has you any clue about that?
I think that it is possible to achieve more speed-up by implementing
those functions in altivec.
Thanks for your comments.
More information about the ffmpeg-devel