[MPlayer-dev-eng] vo_gl PBO patch ..

Wed Apr 30 23:48:40 CEST 2008

On Wednesday 30 April 2008 14:44:40 Reimar Döffinger wrote:
> On Wed, Apr 30, 2008 at 02:05:57PM -0600, Sven Gothel wrote:
> > On Wednesday 30 April 2008 12:29:13 Reimar Döffinger wrote:
> > > On Tue, Apr 29, 2008 at 01:44:32PM -0600, Sven Gothel wrote:
> > > > version 4 .. reflecting Reimar's suggestions.
> > > 
> > > I still have some problems understanding the need for most of those
> > > changes.
> > > While it is really ugly for several reasons and I will have to clean it up
> > > once I have a bit more time,
> > Well, we have a different idea what is more complicated and what is not,
> > that's for sure :)
> 
> Well, the criteria I used is simple: my patch adds about 10 non-trivial
> lines of code, yours 50.
> 
> > My patch just adds a little PBO helper tool,
> > to ease the PBO usage, utilizing a bit OO style,
> > so we don't have to bother with all the floating around variables.
> 
> Reducing the number of global variables is a good goal, but doing this
> in a patch that fixes a performance issue makes it really hard to
> understand where the performance issue is, what causes it, and what is
> the simplest way to fix it.

It depends. Proper management for multiple PBO's indicates a little toolkit.

But .. maybe I am completly wrong, possible,
and the whole thing I did is not necessary - we will see.

> 
> > The new feature adds a ways to enable DMA xfers.
> > The first DMA is video -> PBO buffer, utilizing memcpy.
> > The 2nd is the GPU's DMA xfer using the PBO for glTexSubImage2D.
> 
> How do you define DMA? I learned DMA as data transfer between (usually)
> main memory and a peripheral device under the control of the device,
> thus leaving the CPU free.
> To me, the memcpy used in this case seems to fail all these criteria...

DMA can be any memory transfer not handled by the CPU in PIO mode.
On a system, there are many DMA controller available.

The GPU one usually is able to handle:
	GPUMem <-> SystemMem
	GPUMem <-> GPUMem
maybe even (AMD GPU's can do this for sure):
	SystemMem <-> SystemMem

The system memory controller under some architectures is able to do the same,
at least SystemMem <-> SystemMem.
This maybe utilized by memcpy, to my knowledge it is on ia/x86 platforms.

Even though memcpy would be performed in PIO mode,
using PBO's for glTexSubImage ensures at least this one is using DMA.

> 
> > The impact is a performance benefit especially for HD content,
> > from around 80% -> 20%, factor 4, on AMD drivers.
> 
> This only restates what the whole patch achieves, but it does not
> clarify _which parts_ of the patch are relevant for this.
Right, let's drop the details of the patch and
let's discuss the issue itself.

> E.g. the current PBO handling code uses 1 PBO for YV12 mode, your patch
> changes this to use 3. Does this have any effect on speed?
No.
It just ensures, that we setup the PBO only once.
Of course, instead of the 3 (full-size + 2 half-size),
you may create and use the fullsize only, since they are used sequentially.
That would be fine .. IMHO.

> And what about the changes to draw_slice? Do they improve speed?
Yes.
It ensures at least one guaranteed DMA transfer (see above).

> Was that code tested at all? May it actually be slower that the original
> code?

Actually this path is not that critical, at least not with my test cases.
Ie. I do not have HD mpeg content to decode with mpeg12, 
ffmpeg12 doesn't use slicing, I have observed.

Regarding the slicing part, I see that you always copy the whole video frame.
So we may can drop the previous slices/copies and do it only on the whole image,
ie when y+slice==h ?

> This case is less critical since the new code can be disabled, 
Right.

> but 
> having all these changes to code paths that are used in very different
> situations in one single patch is likely to find out where issues are if
> any arise.

I totally agree that we have to test and discuss all of this,
that's why we love code reviews.

And having more test data/information, ie HD codecs, etc,
I like to run a few test with those as well.

However, I believe my patch doesn't change anything fundamental/functional,
and it is optional anyway -> no impact with the default pbodma=0.
The latter stament should be verified first (IMHO),
because it is important that I don't break anything, of course.

And let's drive our performance discussion forward .. 
it's fun to do so.

> 
> > > could you please use attached patch and 
> > > tell me what it misses in functionality compared to yours?
> > 
> > I do.
> > 
> > As far I understand your patch,
> > you perform a memcpy, ie DMA xfer, to shape the mp_image_t, 
> > so you get rid of the GL_UNPACK_ROW_LENGTH trouble.
> > But this should be redundant, since this PBO case hits if you 
> > are already in charge of the texture and video buffer, ie. get_image.
> > 
> > Your patch would still use slow PIO in glTexSubImage,
> > because of the lack of direct PBO in get_image.
> > (see below)
> 
> I'd prefer if you test, because it should be exactly the other way
> round, 
Well, at least I have catched that 'GL_UNPACK_ROW_LENGTH fix' :)

> it should be faster (and with my nVidia card vo CPU usage does 
> drop from 14% to 6%) but it should not fix the GL_UNPACK_ROW_LENGTH
> issue (though it probably is trivial to fix).

I see.
But NVidia's driver is already using DMA xfer's for a simple texture upload,
even without PBO's.

The GL_UNPACK_ROW_LENGTH ATI bug currently only appears for NPOT strides
using PBO's.
Normal texture uploads just works.

> 
> > > Another thing is probably the GL_UNPACK_ROW_LENGTH support you said ATI drivers
> > > are lacking, if this still causes a problems with this patch I'd need a bit
> > > more information on this.
> > 
> > Well, it seems it is simply a bug in the current ATI driver,
> > which enforces us to use multiple memcpy, ie. shape the PBO memory
> > so no NPOT stride happen - even though the driver claims to support NPOT.
> 
> What is NPOT supposed to mean? My only interpretation of NPOT would be
> non-power-of-two 
right. Actually they use it for non power of two textures,
but we get the drift, right.

> but that makes no sense. Do you mean it requires 4-byte 
> alignment or something like that? 
Nope not a n-byte alignment, 
but a required dimension, where each side is = pow(2,n).
You know, the so very old texture requirement.

> In that case, the problem is more 
> likely to be in the GL_UNPACK_ALIGNMENT handling...

that's what I tried to communicate - but only related to PBO xfers.

This problem is only an issue using PBO's for tex uploads
and is 'just' a bug for ATI cards, hence the pbodma option
for multiple memcpy.

The root problem is the slow texture upload on, let's say,
non NVidia cards/driver, which just don't use DMA implicit.

Cheers, Sven

> 
> Greetings,
> Reimar Döffinger