It uses a bunch of complicated heuristics based on each GPU, that tells it whether to use the 3D engine, the 2D engine, or the CPU to do specific bits of rendering the fastest way, and then combines everything together.
Just tested SNA, kernel 3.5rc4, xserver 1.12.1, intel 2:2.19.0-3 (debian). With KDE 4.8 i only need to press alt+shift+f12 a few times and the xserver crashes. Very usefull...
2. Everyone who's ever written an acceleration architecture has at least hoped that it would be a "unified acceleration architecture". The fact that none of them have worked out to actually be universal (except for EXA which, for a certain period of time, worked for ATI, Intel and Nvidia cards) shows that individual cards have diverging hardware making it difficult to create one efficient architecture fo rall cards.
IMHO the biggest discrepancy between cards is the memory model. You have at least these memory models:
1. Discrete GPUs have their own VRAM (usually a LOT of it), which is blazingly fast when accessed by the GPU but painfully slow when reading it from the CPU
2. IGPs on the motherboard use system RAM, but since integrated graphics before the advent of processor graphics (Sandy Bridge and later) is very slow, you can't make very many assumptions about the performance of IGPs at all
3. Processor-based graphics such as AMD Fusion and Intel's Sandy/Ivy Bridge have extremely fast memory read and write (low latency) to system memory, but unfortunately, system memory is much slower -- lower bandwidth -- than VRAM.
4. Hybrid models such as Nvidia Optimus and LucidLogix Virtu present their own performance characteristics.
Not only does the memory model require vastly different programming at the driver level to enable the functionality; it also affects the "cost" of certain operations. So, for a discrete GPU that is reading and writing between VRAM and the 3D engine without much interaction with the CPU, that's going to be REAL fast, because the GPU can access its VRAM faster than almost any other operation on your system except the CPU's L1/L3 cache. BUT, if your acceleration architecture ever makes the discrete GPU upload some data from its VRAM to the CPU, that can be a significant cost. But on the IGPs and processor-based GPUs, they wouldn't mind reading from "VRAM" (really, system RAM) at all. Better still, for Sandy Bridge and later, the memory controller itself is on the processor!
I think it makes sense to at a bare minimum have two acceleration architectures:
1. One that's optimized for discrete GPUs where GPU<->VRAM is cheap but CPU<->VRAM is expensive;
2. One that's optimized for integrated GPUs where GPU<->VRAM is about the same cost as CPU<->VRAM but neither one is fast enough to really compete with a discrete GPU.