Quote Originally Posted by RealNC View Post
I don't view it that way. If you're gonna have, say, 8MB cache on 4 cores, it's better to make it shared rather than 2MB per core. That way, on loads that involve fewer cores the cache increases (on a two-thread load you have 4MB per core).

But of course that view comes from someone who doesn't know the details behind CPU cache memory :-P
The issue is not how you do it, is that is not possible, is that reduced cache but in lower levels, is worse if you simply combine with a bigger cache. Having a big shared cache theoretically is better but in multi-threaded context can make things slower. Let's take a typical case: you make a 'make -J9' (meaning 9 processes/jobs will try to stress all your 8 CPU cores). As the processes start, every time a process switch happen (even in a multiple small caches), the new process will simply blank the cache. This is bad, but it is not that bad, as the code of the source code may fit in that cache fairly right. And even it would happen that blanking, the cache of the one core will be make a theoretical worst case slow down of around 12.5 percent (in accessing memory). Let's get to a make -J8 with a shared cache: this will make that you start process 1 to process 8 initially and they override one's other cache, but when some things appear to be common, one of processes ends, and a new process, process9 will simply clear once again the shared cache making it in vain all that "locality" that was given to the cache.
Another issue is how cache are made, L1, is closest (as distance and CPU cycles) to the math units, memory unit, and so on. The L2 is a bit further apart, and L3 (that we think as shared cache) is the slowest. The electrons have to "walk more" to get the data from the cache to any specific core. So the solution that most people will say is to have 1 MB L1 for every core (the opposite of the 8 M L3 shared). This will make the synchronizations of the CPU impossible (I mean possible like in Athlon X4, but slower, as L3 is used also for syncing the data between cores).
The cache hit/miss ratio and branch prediction (if you have a cache miss, you would have to make the predictor to make computations right (in advance) in the time you wait for memory) is very hard to get it right, and to succeed it the work is on two fronts: how software is written (for Bulldozer that your multi-threaded process will try to have the core most used logic to fit in 2 MB) and how to not get in the architecture bottlenecks.