Phoronix: Intel Mesa Driver Gets HiZ Support For Haswell
If running the latest stable components powering the Intel Linux graphics driver (namely the Linux kernel, Mesa, and xf86-video-intel), the open-source graphics support for the forthcoming Haswell processors should be in fairly good shape. However, like Sandy Bridge and Ivy Bridge, it will take some time before the Linux graphics driver is fully-optimized. Fortunately, there's another newly-enabled Haswell feature to report within Mesa...
Hierarchical Z in itself isn't itself about culling pixels by depth early. That's generally called "early Z." The idea is that a fragment shader can be expensive: it takes processing time calculating color values and eats up texture bandwidth by reading from samplers just to find out that it's behind what's already there. Most shaders don't do anything special to the depth value so the value calculated up front by the triangle rasterizer is what's tested against the depth buffer; there's no reason to run the shader to find the depth value. Some shaders do modify depth and hence switch off early Z, which is a bad idea if you don't _really_ need custom depth calculations.
Hierarchical Z is essentially a form of mipmapping for the Z buffer. You don't just set one depth value in an area of the screen but a whole group of contiguous values representing some triangle, and triangles tend to form larger shapes like walls that all have a smooth depth range filling up a solid rectangle-ish area. It's hence likely that if one fragment is at a certain depth then all the nearby fragments are at a nearby depth. If you take a region and store the furthest depth in it for that whole rectangle whenever you update any individual depth value then you can let early Z do its things without querying a large depth texture for each fragment; querying a (cached) hierarchical value means that less memory bandwidth is needed by early Z.
A further trick is to do front-to-back sorting of coarse objects. Combined with early Z, this means that the closest object is drawn first and later objects are culled by early Z so you only eat up the texture bandwidth, fill rate, and processing time for each fragment once in the case of overlapping objects. Without said sorting, your objects may inadvertently be rendered in painter's algorithm, which completely undermines early Z optimizations. (Unfortunately most kinds of translucency effect essentially also undermines early Z, which is why they're rare in games and other real-time rendering apps.)
A consequence is that an app must be designed to really take advantage of early Z (and hence hierarchical Z). A naive renderer will get much less benefit from them, essentially by chance. A good renderer takes them into account and gets maximum benefit. A very poorly designed renderer can effectively disable early Z entirely for all scenes, even those that could really use it. Hence yet another reason that driver updates can have a huge impact on some games and no impact on others; a fancy driver optimization that a game essentially turns off doesn't help any, while another game written assuming that feature is there can see a huge performance jump from just that one new feature. Well-written AAA games/engine/renderers all assume early Z since its ubiquitous in PC/console hardware. Unfortunately early Z is not a part of either the GL or D3D specs so its implementation and specific behavior is up to each hardware manufacturer and the things that can disable early Z can differ between each. Writing to the depth value in fragment shaders is a sure way to do it, though.
This is is all irrelevant to some mobile hardware (which is getting rarer) which often uses deferred rendering (NOT the same as deferred shading; even many graphics books written by very experienced folks get these terms mixed up, alas). Desktop hardware typically renders triangles as they're fed into the GPU. Deferred rendering GPUs take a list of all triangles in a scene, breaks them into regions, then sorts them. Each region is rendered in order, meaning that only a small amount of scratch space is needed for an otherwise large scene. Because the triangles are implicitly sorted there's far less reason to sort them in the app itself or worry about making the best use of early Z, which doesn't even exist on this hardware.
Overall, hierarchical Z shouldn't see particularly huge benefits to performance since it's only purpose over early Z is to save a small bit of texture bandwidth during early Z checks. AMD's HyperZ is their marketing term for the combination of early and hierarchical Z, so it makes sense that enabling "just" HyperZ would have a large impact. For Intel, if they already had early Z in the driver, then this won't be as big of a boon. I don't know if Intel's driver already had early Z or not.