A question to brigdman or airlie XD about crossfire
well congratz to all about all progress in OSS driver now im finally free of my FGLRX nigthmare, well i got so happy with my card, that well i wanna try to contribute something XD and well i have some free time this quarter and well low lvl C dev is always fun to do at nigths
well i was thinking something we talked in iirc some time ago about multigpu rendering in OSS driver, well i was thinking about it and it doesnt seems that hard now that most basic code is there. XD
i think in my free time i could achieve something in that department (i need to study low lvl C again XD many years in C++ but that should be fast ish), so here is my question
could you provide me some directions and source files needed to do that? i ask cuz well learn the entire driver will take me some time so maybe with some directions i can just focus in that speficic section and lose less time
thx for the hard work
Last edited by jrch2k8; 01-21-2010 at 03:54 PM.
You're probably going to be doing Alternate Frame Rendering, which at a simplistic level will require :
- changes somewhere in the stack to buffer up rendering work from frame N so that the app can start giving you drawing commands for frame N+1 while frame N is still being drawn - I haven't looked at the current code enough to have a good feeling for whether this can all be done in the hw driver or whether some/all needs to be in the upper level mesa/libgl code
- possibly changes to the buffer swap code to support a queue of rendered frames
- changes to initialization code (and probably other places) to allocate two copies of each required buffer, one for each GPU, rather than one
- changes in the lower level code to direct HW drawing commands to the right GPU
My guess is that learning the whole HW driver is something you're going to have to do anyways.
If airlied says something different listen to him.
well bridgman i wasnt thinking about AFR per se, my idea is more like RQCPR (Random Queue Commnand Parallel Rendering) XD
i think of this concept based on the painful that seems to make a game crossfire/SLI aware. at least in windows it seems that doesnt matter the technique you use you always need software side optimization to achieve some performance scale in multiGPU systems, so i thinked if i can just make parallel the commands sent to the GPUs instead of handle the code itself using frame or AA techniques, this should provide noticeable performance scale without the need of that much software side optimization. unlike CPU code wich i can parallelize using threads on software side using openMP or posix threads, etc in GPU i think if i just make the driver aware of the other GPUs and then intercept the code where the commands are sent to the GPU and make it to choose randomly a GPU to process that command (adding some controls to avoid sent commands to a higly used GPU when the other have less load for example) it should behave like a parallel code in theory. in theory when you achieve that stage the perfomance should improve noticeably in the case of demanding applications like games cuz well now you need to saturate not just one GPU but all of them at once to get a drop in the FPS.
i choose a ramdomized algorithm instead of a sync one, cuz well in my experience random code tends to behave better balancing load with less overheat and latency present in synced code.
im aware too that only 1 GPU have to render to screen the final frame, if im not wrong all GPUs can process without much issues in the framebuffer composition stage. this should be easy though i just need to intercept the speficif command to swap the framebuffer to screen and just force it to always go to GPU0
now is true too that im not a GPU guru so maybe my idea is too based on CPU like coding and dont work ofc.
Some concept of this technique (fix me if im wrong here wich is very probable)
1.) when cards are in crossfire (just to give it a name) both GPU can map the sum of the total memory of both GPU aka 2 Radeon HD 4870 512 would make a 1G framebuffer accesible from both GPU.
2.) as far as i understand if the memory is accesible for both GPU it shouldnt matter wich GPU process what, cuz well the result goes to memory so a different GPU can access the result and processing it again as needed (in the case of serialized/sync instruction not sure cuz in opengl you dont specify which GPU should take the commands but i can be wrong here)
3.) unlike CPU i think i readed that GPU dont handle process ID system at least in OpenGL rendering, (or mesa do it internally? not sure either) so i assume that OpenGL just interact with object in framebuffer in a serialized way to create a frame, so i should be able to bypass the sync buffer overheat randomizing the algorithm that decide which GPU should process X command.
4.) i suppose that exist a way to know how much load a specific GPU have in X time (at least catalyst for windows say it)
i was thinking too in another profile that you can choose what the gpu can do, for example GPU0 load textures into memory only and GPU1 handle all the processing, filters and etc (sound a little crazy but i think is doable, now sideeffects could be nasty in some cases i can think of), think this way you can improve the performance of video playing for example, specially HD video or raw video.
so imagine a video from a raw camera at HD lvl of quality and you want to apply very heavy image filters to each frame like those in Cimg library and deinterlace and postproc in real time. using that another profile you can really speed it up using 1 GPU to load those huge texture in memory and 1 or more GPU only processing the frames.
well thx to listen to my crazy thougth XD
I'm not experienced in GPU architecture, but AFAIK two GPU's cannot easily share memory. It's all about bandwich and latency - Crossfire/SLI-link is not capable of making such large transfers of data between cards, PCIe too. By doing so the performance would decrease noticably, even below single card level. This is why current NVidia/AMD binary drivers mirror data on all cards (making 512M+512M combination as capable as 512M one, not 1G) and decrease level of interaction between cards as much as possible.
This is only my thought, not sure if is correct one.
The two GPUs have separate address spaces for all practical purposes.
It is possible for one GPU to access memory on the other GPU but nowhere near fast enough for the approach you described. The video memory bandwidth on a high end card is over 100GB/s while the transfer rate of a Gen2 X16 PCIE bus is maybe 8GB/s.
Even the 100GB/s video memory isn't fast enough to keep up with a modern GPU on its own. Drawing is actually done from (texture) and to (render target) on-chip caches which are much faster than the video memory - and the cache on one GPU doesn't know about writes from the other GPU. If drawing were forced to go directly to video memory to ensure inter-GPU coherency the performance would drop by a factor of 20 or more.
Gimme an "A" !
Gimme an "F" !
Gimme an "R" !
Whaddya got ?
Seriously, if you combine the hardware issues above with ever-growing amounts of post-processing and the fact that current graphics APIs don't do a good job of identifying dependencies between operations (which is why you end up with the draw/flush/draw/flush model) AFR is still the most user- and developer-friendly approach.
Last edited by bridgman; 01-22-2010 at 12:04 PM.
now this limitations affect X2 cards too?
Yes. Some cards have inter-GPU buses but those buses are nowhere near fast enough to provide the illusion of a single memory pool.
Accessing off card resources always comes at a performance penalty.
Or off-chip in the case of an X2 card