I don't quite understand what you're trying to say.
Originally Posted by cruiseoveride
That's the point of benchmarking. When Crysis came out, nothing could run it, and it was used as a benchmark for this exact reason.
And since ATI dropped their Catalyst driver for cards before r600, Linux users can only use this open source driver. And for many of them, performance is important. They cry on the internet about how they want optimisations. That's what Michael is reporting on.
How do I test out this new 2.6.38 kernel ? I'm on 2.6.37 right now (source compile myself for my HTPC). I saw on a version 2.6.37-git9 on kernel.org. Is that the 2.6.38 branch? or is that the 22.214.171.124 branch ?
Ah, these results are much more in line with bridgman's comments about not understanding why the driver is not performing that great. Because hey, it turns out that it actually is performing very well for the older hardware. So for R300-R500 the objective of reaching a sizeable performance of fglrx is...basically done I'd say.
Just curiosity: why openarena results are that bad compared to fglrx?
You'll either need to fetch Linus's tree or wait until the merge window closes and RC1 is released and download that
Originally Posted by ntt2010
If you opt for the git tree you'll find that it'll still compile as 2.6.37 until the make files are changed
It can be confusing and there have been many discussions about creating an RC0 as soon as a final version is tagged - these have always been rejected however
There's still lots of work to be done
Originally Posted by yotambien
For one fglrx is multi threaded
the card is starved waiting for the CPU. An even more extreme example is glxgears -- very simple scenes, any GPU can render them really fast, so the CPU performance of the driver becomes the bottleneck.
Originally Posted by yotambien
Basically, fglrx can push a lot more data to the card much faster. At realistic frame rates (around 50-60), it is not a very big factor, but if you're pushing 1080p at 300 fps, it is.
I don't think it will improve performance much, surely not more than 1.5x. I'd say that 1.2x or 1.3x speed-up is more likely to happen. The problem with multithreading is that each thread must keep its own copy of drive state for it to be thread-safe, and duplicating such state may hurt performance such that it may become worse than with no multithreading.
Originally Posted by Drago
I think most of speed-up will still come from doing clever CPU optimizations and rearchitecturing the upper layers of Mesa rather than from "hacks" like multithreading.
Can't *all* the driver code run either on separate thread when gl* call permits it, or on the app main thread when it does not? Can't we do without copying state information, just setting locks here and there?
It's not so simple. Note that in an ideal world, you never want to hit those locks, so that the consumer thread can be completely asynchronous. And if it's not asynchronous, not only won't you get any speed-up, it may even slow things down because of the overhead of the additional work required for multithreading. So staying asynchronous is damn important.
Originally Posted by Drago
Now if you have a closer look at the GL API, you'll realize that there are so many sync points like the glGet and glGen functions that will kill any kind of asynchronous processing. Also there are lots of places where data copies are needed for the consumer thread like glTexImage, glBufferData, gl*Pointer calls, and so on, which add additional overhead.
So basically doing multithreading at the GL function dispatching is a damn bad idea. Doing it deeper in the stack may turn out to be a lot better, but not all of the GL function overhead could be hidden then. It all comes down to how to design a working solution that's faster than what we have now, not slower. And making it *not slower* may turn out to be very hard.
Threading and locking hurts just as often as it helps unless the work load is particularly well suited to threading and the algorithm implementing it does is intelligently instead of doing the brute-force after-thought threading that drago's proposing.
The cases where multi-threading is most likely to help is when the driver is doing a lot of transformations on the data (such as converting an array of 64-bit doubles into 32-bit floats for hardware that doesn't support larger data sizes). If an application is working with a lot of these, then the threading can be a big win. If the application is mostly just submitting data that's getting copied to the GPU verbatim, then the threading has a much smaller effect. The command checking/verification maybe could use it, and if there's any blocking on the actual buffer submission then it could use it, but the overall gain will be low, and the potential for overall efficiency loss due to the locking is pretty high. The applications are submitted things sequentially and the driver is working sequentially because of it, so simply reducing the overhead is going to be worth a lot more than trying to spread the overhead around.
I'm still of the opinion that optimizations in the driver to work around application bugs or poor API usage should be avoided, especially given the limited manpower. Hell, even in the proprietary drivers, I kinda laugh a bit when I see release notes in Catalyst or the NVIDIA drivers that just work around some commercial game's well-known and unfixed bugs. I really do wonder how many of those tens of millions of lines of code AMD claims is in Catalyst is just a bunch of crap to optimize various poorly written games so that they get better performance scores in reviews. FOSS drivers that won't even ever be running those games don't need those workarounds, and FOSS drivers intended for a FOSS ecosystem of FOSS games should rely on documentation and education to improve client applications rather than hacks in the driver. If the driver is being a bottleneck because it's trying to fix application data before sending it to the GPU, file a bug with the application rather than trying to add a shitton of code to work around bad behavior.
We'd get a lot more performance from threading if Khronos would just ****ing give us a thread-able API that the client applications could use. This would necessitate thread-friendliness in the driver too, of course, but it's moot until the API exposes those capabilities. A good modern graphics engine (and physics engine and game logic/AI engine and so on) breaks its data set into independent islands/batches and distributes those across a number of pre-allocated/pre-created threads (say, one for each logical CPU). With OpenGL right now, the engine can do the object culling and sorting and other CPU-side transformations, but then when it comes time to building and populating the OpenGL buffer objects and to building the command lists for rendering it has to serialize all that back onto the main thread, and then let all that get bottlenecked there (and a big bottleneck it is, even if the driver magically has 0% overhead).
Very large side note: a lot of amateurs... and even a lot of professionals... try to thread game engines _totally_ wrong. I see way too many people try to make a graphics thread, a physics thread, an AI thread, etc. This gains you almost nothing, and doesn't scale up past quad-core CPUs. They think that they can let the graphics thread do work while the physics thread does work. Sure, that helps a tiny bit. Problem is, though, that the graphics thread can still only really render a single _useful_ frame before it ends up needing to wait on the physics thread to feed it updated object locations/transforms. And the physics thread can only do one iteration of the simulation before it has to wait on the game logic thread to process collision events and feed forces and other object control back into the physics engine. So you can parallelize the processing but you're still stuck moving only as fast as the slowest stage/thread. Plus you're stuck with the overhead of moving the simulation state changes between the stages, because the game logic thread is trying to update state that the physics thread is reading, and the need for consistency during an iteration means that you have to buffer up the state changes from game logic so physics can grab it at the beginning of the iteration, and likewise graphics can get an atomic update of all object transforms from physics at the beginning of its rendering loop.
The entire game engine is a pipeline. Making each stage run in its own thread is at best giving you the same kind of "performance" that a deeply-pipelined, single-dispatch, single-core, narrow-width CPU would give (which is not much at all). Instead, create a pool of threads, and then the main thread runs each stage of the pipeline sequentially. The stages internally break their work into batches and submit those as jobs to the thread pool. This scales out to hexa-core and dodeca-core (12-core) CPUs and beyond, doesn't let any of the threads waste time (and energy and heat and battery life) rendering duplicate frames or garbage physics iterations, is easier and safer to develop as there aren't multiple thread trying to work with the same data in RW contexts simultaneously, avoids the need to duplicate simulation data multiple times in each subsystem's thread just to avoid lock contention, and doesn't require inefficient/expensive inter-thread message queues between the subsystems. A lot of the pros do it wrong because that's what the original Xbox360 communal wisdom told developers to do (it's a tri-core system with very predictable performance characteristics, so the "wrong way" works reliably okay-ish on that hardware), and a lot of the amateurs do it wrong because they (along with most other CS professionals) still haven't gotten a grasp on how to develop with scalable multi-processing. Really, the "right way" looks a lot like how a GPU works: you split your workload into jobs (graphics primitives) and submit them to a thread pool (GPU SPU clusters) in logical groups (draw calls with shared identical state).
Likewise, a threaded GPU driver would be similar. You don't just push each call off to a thread or create threads for handling various parts of the driver. You batch up the API calls into jobs that can be handled as atomic units and farm those out to a thread pool that has an efficient barrier and synchronization infrastructure for ensuring that everything hits the GPU in the right order. That would probably require rewriting or at least rearchitecting a large amount of the Gallium code. Which is probably worth it... but only after all the other low-hanging fruit is taken care of (which it isn't... there's a banquet's worth of fruit a lone one-legged midget could pick off the Tree of FOSS Driver Optimization right now).
(Copied what I just wrote above into a draft article I may flesh out at some point as an overview of threading games. Seems nobody else has written much on the topic, on the Web at least.)
Tags for this Thread