[MPlayer-dev-eng] Re: Should I write Voodoo Banshee VIDIX driver?

Sat Mar 20 01:21:22 CET 2004

Hi Georgi Petrov,

on Fri, 19 Mar 2004 12:45:13 -0800 (PST) you wrote:

> Hurray! One man, who is ready to discuss this with me! Impossible! :)))
> 
> Please, be prepared for many questions in this post, because it's going
> to be loooooong. You're the only man, who can save me may one more week
> guessing what to do! I'll begin with commenting your previous post and
> then continue to ask what I don't exactly understand.
> 
> Please, don't blame me for my English - it's really bad.

np. Mine isn't that good either.

> The specs say, that 2x mode refreshes every pixel twice per clock tick.
> I understand it as double fps compared to 1x. While this is useful for
> some games, it requires computing power from Banshee's hardware, so
> there's not enough left for (the more complex) bilinear filtering.

Ok but then i perfer let that setable from user space.

> 
> >Again, if it work and isn't making any pb (strange side effect
> >afterwards or the like) then a patch is welcome (i couldn't find the
> >one you are writing about). Anyway i'll wait until your driver is ready
> >and then i'll backport all improvment to tdfx_vid ;)
> 
> I doubt there will be any pb, because I've tried it and 1x mode doesn't
> change anything. It just gives our right to see quality resized picture
> :)

Great :)

> Weeeell, then pitch=stride=width * bpp? I've figured the same, but it
> doesn't work. I mean that's the way, but there are many places, where I
> can go wrong.

pitch=stride >= width * bpp

> Let's see what I know and PLEASE, correct me where I'm wrong (it would
> be long):
> 
> There are two major colorspaces: RGB and YUV.
> 
> RGB is used from all monitors, TVs, etc. It's Red, Green, Blue, which
> mix together to produce other colors.
> 
> YUV is created from RGB by this way:
> 
> Y=the sum of R+G+B: It's called luminance (spelling?) and played
> standalone represents the gray-scale of the image (I have English
> difficulties here) - I mean like a old color-less TVs.
> 
> U and V are actually one color, substracted from luminance, eg. U=Y-Red
> and V=Y-Blue for example.

Basically but you have various coefficiants so it's not that straight forward.

> So, U and V are called chrominance - they represent "the color part" of
> the image. Since human eye is less sensitive to this information, it's
> resolution is halfed by by both horizontal and vertical (is the
> resolution or the depth per pixel halfed?). Then comes this 4:2:2 or
> 4:2:0 thing.
> 
> I understand it so: RGB is 4:4:4, because converted to YUV it uses 4
> bits(bits?) for each Y, U and V part. Then U and V can be "cut", so the
> YV12 format uses 4:2:0. Is it 4 bytes or what and how V part becomes 0?
> As a whole, what's this 4:4:4, 4:2:2 (YUY2) and 4:2:0 (YV12). Ohhh, I'm
> lost...
> 
> I know that YV12 uses 12 bits and YUY2 16. There are also something like
> YUYV and the like. What's that, please help!

You miss an essential point. packed vs planar. YV12 is planar but YUY2 is
packed. For yv12 you have the y plane at full resolution 8 bit per pixel
and the u and v planes with half the resolution (half width/half height).
So you end up with an average of 12bit per pixel.
YUY2 is packed basicaly YUYVYUYV... so the y plane is also at full
resolution but u and v have half the width. Average 16bit per pixel.

> Because all codecs use YV12, there would be nice if the video card have
> some buffers, where you can write Y and U planes (for YV12) and V (for
> YUY2). 
> 
> Now comes one of my bigger questions: Banshee's specs say it supports
> only 4:2:2 (YUY2) and 4:1:1 (???) and NO 4:2:0 (YV12) - btw once again
> what's that x:y:z? As from what I understood since there's no hardware
> support for YV12, you have to do software YV12 -> YUY2 conversion and
> then write to Banshee YUY2 data. Is it so?

The banshee can't use YV12 directly but it have hw to convert it to yuy2.
So there is one more step involved but that still faster than any kind
of softawre conversion.
The tdfx_vid_set_yuv function is tdfx_vid.c setup the converter. It's used
in tdfxfb too.

> Ohhhhhh, btw, yes - what's the difference between packed and planar
> format?

see above. You probably want to read DOCS/tech/colorspaces.txt.

> tdfx_vid can cope with YV12, but it's so sloooow, that -vf yuy2 is
> magnitudes faster. Why does tdfx_vid accepts YV12, when Banshee doesn't
> support it? Does volib do some sloooow conversion internally to YV12,
> because -vf yuy2 is really the better variant.

Hmmm. I did read that several time. But for me it never has been the case.
Using xv, tdfxfb or tdfx_vid, -vf yuy2 or -vf scale hw conversion was still
faster. You are running a celeron, i was running a k6 it might be the
difference.

> Or Banshee has YV12 -> YUY2 converter, which I don't know about? If it
> has, why this converter is so slooow (2-3 fps)? When tdfx_vid accepts
> YV12, it playes very very slow the movie. -vf yuy2 gives waaaay faster
> results. XVideo also accepts YV12, but it's at normal speed. Why?

I never expererienced something similar. There must a problem somewhere.

> >YV12 can only be handled using the planar 2 packed converter. It's
> >pretty simple to use. You set the stride and address of your target
> >buffer(where yuy2 data will be writen). Then just write Y, U and then V
> >to the converter address. The converter use a fixed address scheme,
> >each plane is 1MB big and have a stride of 1024 bytes. 
> >Luckily the AGP move function can use different input/output stride so
> >no need for slow loop wich copy line by line :)
> 
> My misunderstanding continues: planar -> packed (what are they)
> converter is in software, right? It's embedded in mplayer as a layer
> between codec and video card - you don't mean that video card converts,
> right?

read vo_tdfx_vid.c there is no converter there. It use the hw thing.

> If it's so, in VIDIX you have no problems, because by rejecting the
> unspported YV12, mplayer automagically converts by software to YUY2 and
> gives me YUY2 data. I don't need to mess with the converter, because it
> converts before the VIDIX stuff and gives me the data already converted.
> That's if you're talking about the mplayer's software converter.
> 
> On the other side, if you're talking about a converter in the video card
> itself(YV12->YUY2), that's something I'm not aware about. It would be
> wonderful. Is it so??? (if it exists, is to slow?)

In my experience it always beat software conversion hand down.

> >This really sound like a stride problem. For the overlay you'll use
> >a buffer you put somewhere in the video mem. You can chosse any
> >stride but you probably want to use the orginal stride so it can
> >be copied "at once". You also have to be carefull as quiet a lot of
> >stuff must be aligned (overlay stride for yuy2 need 4 bytes aligned
> >stride and address for example).
> 
> Ohhhh, noooo, it gets even more complicated...
> 
> I know the overlay address. I mean - all my attempts to set mine failed,
> so now I just read from the videocard where it expects the information
> to begin (the Y, U, V planes) and I just pass a pointer to that memory
> to VIDIX core. Then I must give the VIDIX core more info (dimensions and
> stride) and it begins to write to that memory. All I have to do is give
> it the right info!!! :))) My driver actually doesn't need to do nothing
> more. Just set video card's registers the right way, give a pointer to
> overlay memory and give right stride, colorspace to VIDIX core. From
> then on, my "driver" sleeps and VIDIX core uses Banshee's memory to
> write frame after frame.

Sadly it's not that simple :( The overlay setup is a bit hairy. I succeded
only bcs i had the xv driver source as reference. The specs. sheet are
sadly very obscure on the subject.
To setup the overlay you need to set qiet a few register see in tdfx_vid.c
how it's done. To set the buffer addresse you must set you addresse in
the LEFTOVBUF and RIGHTOVBUF registers and then issue SWAP commands.
You can't directly write the address you want to VIDCUROVRSTART.

> >The bottleneck is really when you transfer data to the card.
> 
> Yes, BUT also by the way you do it :) For example XVideo does one
> unneded memcpy more than required.

i would bet that it does way more than one unneeded memcpy. Anyway
when see how complex/bloated the whole shit is you are only left
wondering how it can still perform that "fast" ;)

> libavcodec decodes to one external provided buffer. Then from this
> buffer into another and then to the video card.

No. If there is no filter and there is no DR then there is only 1
memcpy "too much" (acctually in most case you really can't avoid it).
If you can do DR then there is 0 memcpy as the codec decode directly
into the vo provided buffer. However most codecs need readable buffer
and read from video memory are very slow. So you can't (ie it's slower)
DR to video memory in most case.

> The VIDIX architecture
> allows this unneeded memcpy to be saved.

It the way the video pipeline work wich provide this. VIDIX is just one of
the many vo supported.

> That's not memory->video card
> problem, because the speed between them is constant (may be faster with
> AGP memcpy), but that's problem in what happends BEFORE the data is sent
> to video card. You can't do anything to improve RAM->video card speed,
> BUT you can write smarter code, which doesn't copy the same information
> twice in RAM!!! That was my motivation of writing VIDIX - to save one
> memcpy, which occupied my CPU before sending to video card.

AGP transfert do the difference. compare tdfxfb and tdfx_vid and see how
it help ;) Anyway you will have the exact same number of memcpy as with
tdfxfb. With tdfx_vid it's a bit different bcs the vo write to the
AGP mem (either by DR or memcpy) and then it have to do the AGP move.
So it always do one more copy than tdfxfb.

> In XVideo (and tdfx_vid) I think that something's not right. For example
> when I use double buffering with XVideo+Direct Rendering, I get THE
> FASTEST POSSIBLE CONFIGURATION!!! Why does mplayer dropes more frames
> when double buffering is turned off??? I don't know.

bcs then only 1 frame (on 3) use DR. Use -v and you see the difference.

> Since I saw that I CAN CAN CAN CAN CAN CAN CAN have 23456 -> 67
> framedrops saved from one movie by just turning on XVideo's direct
> rendering+double buffering, I'm sure that there's something that's wrong
> and I CAN ACTUALLY earn more speed - it's software related, not
> hardware.

DR basicaly save one memcpy with xv. This is possible bcs xv give us
a buffer wich is in RAM. It then internaly copy that to the card.
DR can't be used with tdfxfb and tdfx_vid bcs they give memory
wich is in video or agp mem and you can't read from these.
However be assured that you won't find anything in MPlayer video
pipeline wich do one memcpy too much with the video data.

> >BTW if you don't have the banshee specs, it's probably high time to
> >check that. If you have pb finding then just ask me i can send you the
> >stuff i have.
> 
> Yes, I have them and they have helped me a lot, but they are too
> incomplete for me! I mean - I want examples how to use something - I'm
> not that good to see that register x does y and memory address z is used
> for abc or whatever. It's just me, who's inexperienced :(

Don't worry these are not really clear for me either.

	Albeu

-- 

Everything is controlled by a small evil group
to which, unfortunately, no one we know belongs.