[MPlayer-dev-eng] Re: Should I write Voodoo Banshee VIDIX driver?

Fri Mar 19 21:45:13 CET 2004

Hurray! One man, who is ready to discuss this with me! Impossible! :)))

Please, be prepared for many questions in this post, because it's going to be
loooooong. You're the only man, who can save me may one more week guessing what
to do! I'll begin with commenting your previous post and then continue to ask
what I don't exactly understand.

Please, don't blame me for my English - it's really bad.

>the expand filter can draw the osd, just RTFM ;) Most vo just duplicate
>this code as they have been writen before libmpcodecs (and all it's
>filters). I was also hoping to be able to use some hw functions to have
>accelerated OSD :) I attempted to use some 3D stuff of the card but i
>failed miserably. That's probably possible but not while X is running
>(unless you disable all GL/drm stuff perhaps but i doubt).
>It's probably possible to use the blitter if you don't care of the alpha.
>Could be a good idea as using the osd functions would mean reading from
>video (or agp) mem and it's sloooooow :(

This was interesting, I didn't knew it. Your try about the accelerated
subtitles was really good idea :) I think the same way - I want to do
EVERYTHING I can to unleash the Banshee's full potential. Well, I can't do AGP
memcpy, but let's see what will happend from the rest. I know however that
bandwidth is the main limitation :(

In the process of developing this driver I'll learn a lot and that's my second
goal. My first is to watch MPEG4 movies without framedrop on my Celeron/300 MHz
:(

>Wich i think is/was the correct thing to do as long as don't know
>exactly know what this 1x/2x mode do and if you can just set any time.

The specs say, that 2x mode refreshes every pixel twice per clock tick. I
understand it as double fps compared to 1x. While this is useful for some
games, it requires computing power from Banshee's hardware, so there's not
enough left for (the more complex) bilinear filtering.

That's why in 2x mode bilinear isn't available. The solution? Turn 2x mode off
- you don't need faster refreshes of what's in video memory, because it's
contents changes slower (25/30 fps). This fps can be handled very well by 1x.

Locate the following code in tdfx_vid.c: (comments are mine)

if (!(vidcfg | (1<<26))) //Do we have 1x or 2x already set?
   vidcfg |= (3<<16); //We have 1x => turn bilinear on
else
  vidcfg &= ~(3<<16); //We have 2x= > turn bilinear off

It checks 1x/2x mode, don't set it! I believe this is a mistake. The code must
be changed this way:

vidcfg &= ~(1<<26); //Turns 2X mode off (this allows bilinear filtering)
vidcfg |= (3<<16);  //Turns bilinear filtering on

We must not care wheter the card is in 1x or 2x mode. What is important is to
switch to 1x (no need of 2x with 25/30 fps) and then bilinear on.

Trust me - I've tried it and that's the way :) Otherwise, by default, 2x is on
and by simply checking it, bilinear will never be turned on. And, of course, we
don't need 2x :)

Well, I'm a little bit repetitive - I saw one thing many times. Probably you've
understood me 1 page earlier :)

>Again, if it work and isn't making any pb (strange side effect afterwards
>or the like) then a patch is welcome (i couldn't find the one you are
>writing about). Anyway i'll wait until your driver is ready and then i'll
>backport all improvment to tdfx_vid ;)

I doubt there will be any pb, because I've tried it and 1x mode doesn't change
anything. It just gives our right to see quality resized picture :)

When I finish my driver, it will be good documented (with a lot of comments),
so it would be easy to understand. However I doubt you'll see something useful
inthere - actually it's very very easy piece of software. I program it by
seeing the other VIDIX drivers and all of them have 95% in common. Your design
is way more complex than mine - I just follow the VIDIX specification (can it
be called this way:) I only give some addresses, set some registers and then
the VIDIX core does the hard work for me. It's a piece of cake :) The problem
is that I've only once messed with direct commands to hardware (it was CDROM)
and that kind of programming is kind of "away" from me :)

>Well if doesn't display anything it's not really fair ;) But you better do
>speed benchmark (-nosound -benchmark) to compare. You can find those i did
>after writing tdfx_vid here:
>http://www1.mplayerhq.hu/pipermail/mplayer-dev-eng/2003-March/016971.html

Yes, it's so :) Actually it displays *something* - I can guess what's the
video, but it's far from complete. Here I have the following idea: I give to
VIDIX only one address to write to, some dimensions and live the work for it.
Wheter I've set everything right doesn't affect the speed of writing to this
memory. Only false dimensions (smaller) of what has to be written can result in
(false of course) higher throughtput. I'll surely surely make real benchmarks
when it's done.

You're right - now it's not fair ;)

>pitch and stride are the same thing. The distance in bytes betwen 2
>consecutive lines.
>The banshee mem layout is pretty simple. Basicaly there only 2 place where
>you write data. The framebuffer where packed data can be handled and the
>planar 2 packed converter (yv12 to yuy2).

Weeeell, then pitch=stride=width * bpp? I've figured the same, but it doesn't
work. I mean that's the way, but there are many places, where I can go wrong.

Let's see what I know and PLEASE, correct me where I'm wrong (it would be
long):

There are two major colorspaces: RGB and YUV.

RGB is used from all monitors, TVs, etc. It's Red, Green, Blue, which mix
together to produce other colors.

YUV is created from RGB by this way:

Y=the sum of R+G+B: It's called luminance (spelling?) and played standalone
represents the gray-scale of the image (I have English difficulties here) - I
mean like a old color-less TVs.

U and V are actually one color, substracted from luminance, eg. U=Y-Red and
V=Y-Blue for example.

So, U and V are called chrominance - they represent "the color part" of the
image. Since human eye is less sensitive to this information, it's resolution
is halfed by by both horizontal and vertical (is the resolution or the depth
per pixel halfed?). Then comes this 4:2:2 or 4:2:0 thing.

I understand it so: RGB is 4:4:4, because converted to YUV it uses 4 bits
(bits?) for each Y, U and V part. Then U and V can be "cut", so the YV12 format
uses 4:2:0. Is it 4 bytes or what and how V part becomes 0? As a whole, what's
this 4:4:4, 4:2:2 (YUY2) and 4:2:0 (YV12). Ohhh, I'm lost...

I know that YV12 uses 12 bits and YUY2 16. There are also something like YUYV
and the like. What's that, please help!

Because all codecs use YV12, there would be nice if the video card have some
buffers, where you can write Y and U planes (for YV12) and V (for YUY2). 

Now comes one of my bigger questions: Banshee's specs say it supports only
4:2:2 (YUY2) and 4:1:1 (???) and NO 4:2:0 (YV12) - btw once again what's that
x:y:z? As from what I understood since there's no hardware support for YV12,
you have to do software YV12 -> YUY2 conversion and then write to Banshee YUY2
data. Is it so?

Ohhhhhh, btw, yes - what's the difference between packed and planar format?

tdfx_vid can cope with YV12, but it's so sloooow, that -vf yuy2 is magnitudes
faster. Why does tdfx_vid accepts YV12, when Banshee doesn't support it? Does
volib do some sloooow conversion internally to YV12, because -vf yuy2 is really
the better variant.

Or Banshee has YV12 -> YUY2 converter, which I don't know about? If it has, why
this converter is so slooow (2-3 fps)? When tdfx_vid accepts YV12, it playes
very very slow the movie. -vf yuy2 gives waaaay faster results. XVideo also
accepts YV12, but it's at normal speed. Why?

>YV12 can only be handled using the planar 2 packed converter. It's pretty
>simple to use. You set the stride and address of your target buffer
>(where yuy2 data will be writen). Then just write Y, U and then V to the
>converter address. The converter use a fixed address scheme, each plane
>is 1MB big and have a stride of 1024 bytes. 
>Luckily the AGP move function can use different input/output stride so
>no need for slow loop wich copy line by line :)

My misunderstanding continues: planar -> packed (what are they) converter is in
software, right? It's embedded in mplayer as a layer between codec and video
card - you don't mean that video card converts, right?

If it's so, in VIDIX you have no problems, because by rejecting the unspported
YV12, mplayer automagically converts by software to YUY2 and gives me YUY2
data. I don't need to mess with the converter, because it converts before the
VIDIX stuff and gives me the data already converted. That's if you're talking
about the mplayer's software converter.

On the other side, if you're talking about a converter in the video card itself
(YV12->YUY2), that's something I'm not aware about. It would be wonderful. Is
it so??? (if it exists, is to slow?)

>This really sound like a stride problem. For the overlay you'll use
>a buffer you put somewhere in the video mem. You can chosse any
>stride but you probably want to use the orginal stride so it can
>be copied "at once". You also have to be carefull as quiet a lot of stuff
>must be aligned (overlay stride for yuy2 need 4 bytes aligned stride
>and address for example).

Ohhhh, noooo, it gets even more complicated...

I know the overlay address. I mean - all my attempts to set mine failed, so now
I just read from the videocard where it expects the information to begin (the
Y, U, V planes) and I just pass a pointer to that memory to VIDIX core. Then I
must give the VIDIX core more info (dimensions and stride) and it begins to
write to that memory. All I have to do is give it the right info!!! :))) My
driver actually doesn't need to do nothing more. Just set video card's
registers the right way, give a pointer to overlay memory and give right
stride, colorspace to VIDIX core. From then on, my "driver" sleeps and VIDIX
core uses Banshee's memory to write frame after frame.

>Dunno. I have no 2.6 box and i'm not going to switch to 2.6 soon i think.
>So it's up to some 2.6 users imho. But don't worry it's really not
>that hard and it probably not very different from VIDIX coding.

Yes. It's so :)

>> Why not use -vf yuy2 always? At the moment when VIDIX tries to find
>> matching colorspace, I just reject YV12 and say one big YES to YUY2.
>
>You can do that. But on the box i used to write this driver (k6-2 333)
>i never found *any* case where software conversion was faster than using
>the hw stuff. If you prefer sw conversion you can force it anyway.

I don't understand again. Since libavcodec uses YV12 and Banshee doesn't
support YV12, but only YUY2, how can you use "hw stuff" without first
converting by software YV12 -> YUY2? I mean for Banshee this conversion is
always neede. Or there is a way to write directly YV12 to Banshee?

Ohhhh, once again, what's YV12 and YUY2 (sorry, I'm boring)

>And believe me tdfx_vid is faster than *any* other video output method ;)

I believe you. It's just fun to make this VIDIX driver work too. It would be
interesting for me. And of course I still have that little hope that mine will
be faster than yours :)))) (haha)

> Why don't just reject what's not supported by the driver and leave
> mplayer to do the hard work? For example since YV12 is the colorspace
> almost every codec uses, I just say NO to YV12 and the internal
> converter gives me YUY2 - converted using MMX and it's way faster than I
> could ever do it in the driver.
>
>You seems to miss the point that the card hw itself do the conversion.
>Look at the tdfx_vid code you won't find any code to convert betwen
>colorspace. All it can do is ask the card to copy with optional
>convertion/scaling some data wich is in his memory.

I understand that the Banshee does YUY2 -> RGB by hardware. Well, the whole
point is to use it's hw capabilities and do as less as possible by sw, but how
can I give it YV12 data without converting it to YUY2, when it doesn't support
YV12? Ohhh, I'm stupid... Explain please.

>The bottleneck is really when you transfer data to the card.

Yes, BUT also by the way you do it :) For example XVideo does one unneded
memcpy more than required.

libavcodec decodes to one external provided buffer. Then from this buffer into
another and then to the video card. The VIDIX architecture allows this unneeded
memcpy to be saved. That's not memory->video card problem, because the speed
between them is constant (may be faster with AGP memcpy), but that's problem in
what happends BEFORE the data is sent to video card. You can't do anything to
improve RAM->video card speed, BUT you can write smarter code, which doesn't
copy the same information twice in RAM!!! That was my motivation of writing
VIDIX - to save one memcpy, which occupied my CPU before sending to video card.

In XVideo (and tdfx_vid) I think that something's not right. For example when I
use double buffering with XVideo+Direct Rendering, I get THE FASTEST POSSIBLE
CONFIGURATION!!! Why does mplayer dropes more frames when double buffering is
turned off??? I don't know.

Since I saw that I CAN CAN CAN CAN CAN CAN CAN have 23456 -> 67 framedrops
saved from one movie by just turning on XVideo's direct rendering+double
buffering, I'm sure that there's something that's wrong and I CAN ACTUALLY earn
more speed - it's software related, not hardware.

Of course with double buffering+direct rendering, subtitles get corrupted.
Ohhhh, this imperfect world.

Isn't double buffering supposed to prevent flicker and in the same time give
(S)LOWER performance? How can you explain that by turning it on, I have WAY
better performance? And if it can be faster (XVideo+db+dr is actually faster
than tdfx_vid on my system, REALLY, I'm not joking - this IS tested MANY
times!), what's wrong? I'll tell you - something software related is wrong.

>> 
>> Thank you a lot. Without tdfx_vid, I would never succeed.
>
>It's a pleasure for me. You know after writing tdfx_vid i got nearly 0
>feedback. So even it's long after i really enjoy discussing this stuff.

No, I thank you.

>BTW if you don't have the banshee specs, it's probably high time to check
>that. If you have pb finding then just ask me i can send you the stuff
>i have.

Yes, I have them and they have helped me a lot, but they are too incomplete for
me! I mean - I want examples how to use something - I'm not that good to see
that register x does y and memory address z is used for abc or whatever. It's
just me, who's inexperienced :(

>BTW2 my banshee lie unused atm so i can't do any testing. But i'll put it
>back in some box soon i think.

YES, YES, YES :)

Georgi

P.S. I wanted to write more, but let's leave it for the next time - now I have
much information to understand :)

__________________________________
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com