LLVMpipe Scaling With Intel's Core i7 Gulftown
Phoronix: LLVMpipe Scaling With Intel's Core i7 Gulftown
When finding out that an Intel Core i7 970 "Gulftown" CPU was on the way, which boasts six physical cores plus another six logical cores via Hyper Threading, immediately coming to mind was to try out this latest Intel 32nm processor with the Gallium3D LLVMpipe driver. There's a lot to love about Gallium3D when it comes open-source Linux graphics drivers with the possibilities being presented by the different state trackers (such as native Direct3D 11 support on Linux) and the hardware drivers themselves being more advanced, easier to write, and eventually should be much faster than the classic Mesa drivers for Linux. One of the drivers that has especially been of interest is LLVMpipe, which is an attempt to finally make a useful CPU-based software rasterizer for Linux by leveraging the Low-Level Virtual Machine infrastructure. Here is our introductory article to LLVMpipe and even with a Core i7 "Bloomfield" processor the driver is very demanding, but with Intel's Gulftown the results are somewhat surprising as we experiment with how this CPU-based driver scales up to twelve threads.
Given that graphics is an "embarrassingly parallel" problem, shouldn't it be possible -- theoretically -- to achieve very nearly linear scaling with the number of CPU cores? I'm not saying it would be easy, or that llvmpipe is flawed if it doesn't -- just asking whether, theoretically, it's within the realm of possibility to achieve.
Though I guess one complicating factor here is that it's not just the graphics, but also the normal game logic itself which is running on the CPU at the same time. Have you guys considered trying some kind of purely-graphics benchmark to try and isolate that factor?
So going by these test results it seems that adding the 6 logical (HT) cores to the physical cores is actually a hindrance to performance at low resolutions and only becomes at all beneficial to performance at high resolutions and only minimally so, at least as far as LLVM Pipe is concerned.
Is this a joke? A $1K CPU to use as a soft renderer being able to play games only @800x600.
I don't understand the meaning of this article. To show that LLVMpipe scales well? But who's gonna use it anyway?
"The performance improvement seen is very application-dependent, however when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper Threading Technology is turned on. "
Originally Posted by sirdilznik
Is it known that current mainstream rendering techniques are embarrassingly parallel? I haven't studied the algorithms to any real detail, but it would surprise me if they are (I'd expect some issues with Z-sort and overlapping fragments, at least). Surely some important parts of it are, but that's different from the whole pipeline scaling ideally.
Originally Posted by illissius
In the last year, ATI got nearly double the performance going from 160 to 320 execution cores, so yes, 3D rendering is very definitely embarrassingly parallel.
With the current accepted rendering algorithms, Z-sort doesn't need to always happen. Only for transparent rendering you need to sort, and then, only that which is in the tile frustum.
Not to mention that CPUs themselves do not scale linearly either as each core is going to be sharing L2 cache and main memory bandwidth.
Originally Posted by Ex-Cyber
A summary of sort-of typical rendering in 3D (without considering the actual game logic):
Order notation used.
1. Determine view frustum - O(1) - Serial
2. Determine objects in frustum - O(log n) - Somewhat parallel, but not great
3. Roughly sort opaque objects from front to back - O(log n) - Mostly serial
4. Emit every object - O(n) - serial
4. Where surface is split into tiles - almost O(n) parallelization: (reasonable gain here)
4.2 Throw away if unneeded in tile - cheap, early exit point
4.2 Emit each part of object - O(n) - serial
4.2.1 compute render region - O(1) - serial
4.2.2 for each pixel under region - stupidly parallel (most of gain here)
22.214.171.124 test if visible - O(1) - cheap, early exit point
126.96.36.199 render - O(1)
5 & 6. More-or less the same as 3 & 4, but transparent objects sorted back to forward, Sorting here can be more expensive, and early exit points much less used
7. For each post processing: - O(n) - serial
7.1 For each pixel: - stupidly parallel (most of gain here)
7.1.1 Do something
Um, I think that is about it?
Of course, limits such as cache hits, bandwidth, unbalanced workload, etc... all contribute to slow it down.
I think the issue here is that while graphics still has a big chunk of embarrassingly parallel work the individual tasks are extremely small so for real scalability you either need some hardware scheduling (like a GPU has) or you need to design the software renderer from day one around the idea of having a very large number of cores/threads (as was attempted with the Larabee renderer).
AFAIK the LLVMpipe renderer was designed for "one to a small number" of threads... I'm pretty impressed with how well it scales.
I'm only looking at the results from 1 core to 6 cores, since the jump from 6 to 12 isn't really bringing more cores onstream just more threads per core.
Tags for this Thread