[FFmpeg-devel] transcoding on nvidia tesla

christophelorenz christophelorenz
Mon Feb 11 10:15:03 CET 2008

>>Total lost of time and 100x slower on gpu : (gpu probably has to 
>>emulates all the required bit functions and data impose a serial 
>>operation so no parallelisation is possible)
>>Bit stream parsing....
>Quite what I figured with only theory and some FX5200-level GPU
The 5200 is pretty old technology now.
On the 7800, the bit operations are emulated using floats.... no comment.

The Ge8800 now has support for ints, and it is a huge step forward in flexibility compared to a 5200.
It is now a real generic processor with almost none of the previous limitations.
But still it is designed to do massive parallel stuff, bit level operations will never be efficient, unless they add a special core to do bit stream parsing.
They already did, but no public api and broken implementation on half of their chips.

>>CUDA has a much better memory transfer performance than DirectX / 
>>OpenGL, examples show 3Gbytes/sec (up and down) but it vastly depends on 
>>motherboard used.
>>Anyhow, it is still a memory copy. If you need to do this often it will 
>>ruin performance.
>Hmm... I though when using things like PixelBuffers the mapped memory
>can (and if you are lucky will) be graphics memory (or at the very least
>directly DMA-capable), so no additional memcpy would be necessary if you
>write/read directly into/from that.
>There is still some additional latency though.
>And admittedly I never got it to work with anything besides RGB32 data...
You can do that, but it will stall the gpu processing because of the 
It is much more efficient to transfer to gpu memory first, then process 
from there, as the transfer operation doesn't bloc the gpu process queue.
It is realistic when using directx, however with openGL you don't really 
have control over the transfer... (unless you force a costly buffer copy)
It might decide to delay the transfer just before using the texture 
which might ruin the performance.
The big advantage of cuda is that you control exactly when the transfer 
takes place and it is optimised even for small amount of data.
Using pixel buffers to transfer small amount of data is very expensive.

Also you cannot write to system memory using pixel buffers. (render 
targets must be in gpu memory)
So you have to do the process then copy back data.

I think massive parallelisation will be the future of computers and 
programmers whatever happens.
The challenge will be redesigning all the old the algorithms that are 
serial by design to work in parallel.
Basic things like an efficient generic sort are really challenging.
So I won't even talk about porting ffmpeg on it...


More information about the ffmpeg-devel mailing list