Page 1 of 2 12 LastLast
Results 1 to 10 of 21

Thread: Multi-Core Scaling Performance Of AMD's Bulldozer

Hybrid View

  1. #1
    Join Date
    Jan 2007
    Posts
    15,080

    Default Multi-Core Scaling Performance Of AMD's Bulldozer

    Phoronix: Multi-Core Scaling Performance Of AMD's Bulldozer

    There has been a lot of discussion in the past two weeks concerning AMD's new FX-Series processors and the Bulldozer architecture. In particular, with the Bulldozer architecture consisting of "modules" in which each has two x86 engines, but share much of the rest of the processing pipeline with their sibling engine; as such, the AMD FX-8150 eight-core CPU only has four modules. In this article is a look at how well the Bulldozer multi-core performance scales when toggling these different modules. The multi-core scaling performance is compared to AMD's Shanghai, Intel's Gulftown and Sandy Bridge processors.

    http://www.phoronix.com/vr.php?view=16589

  2. #2
    Join Date
    Oct 2008
    Posts
    3,173

    Default I think it may have also been interested to enable the cores together in modules

    rather than separately 1 per module.

    That might allow them to share cached data more efficiently between threads?

    And I think at lower core counts it could enable more aggressive turbo frequencies.

    But this was an interesting test as well.

  3. #3
    Join Date
    Nov 2007
    Posts
    1,024

    Default

    Quote Originally Posted by smitty3268 View Post
    rather than separately 1 per module.

    That might allow them to share cached data more efficiently between threads?
    You don't generally _want_ data to be shared betweens threads. That would just mean your threading architecture is all wrong and being hamstrung by data dependencies/locking.

    Shared caches save money. They don't improve speed. (generally speaking, of course)

  4. #4
    Join Date
    Jul 2008
    Location
    Greece
    Posts
    3,798

    Default

    Quote Originally Posted by elanthis View Post
    You don't generally _want_ data to be shared betweens threads. That would just mean your threading architecture is all wrong and being hamstrung by data dependencies/locking.

    Shared caches save money. They don't improve speed. (generally speaking, of course)
    I don't view it that way. If you're gonna have, say, 8MB cache on 4 cores, it's better to make it shared rather than 2MB per core. That way, on loads that involve fewer cores the cache increases (on a two-thread load you have 4MB per core).

    But of course that view comes from someone who doesn't know the details behind CPU cache memory :-P

  5. #5
    Join Date
    Oct 2011
    Posts
    2

    Default

    Hello.

    Very nice test suite, but I would propose some changes:
    1) To judge the efficiency of scaling per architekture, functions like Turbo should be disabled. When enabled it is only natural that the scaling with more threads gets lower.
    2) I would change the graphs so that they are easier to interpret by the looks (so that linear scaling would look linear). For example, the x-axis should be linear if the y-axis is linear. Not like you have it now, with 1 to 2 distance being the same as 2 to 4 distance. It just looks weird.


    2RealNC> It is not the case with BD modules. No matter if one per module or two per module, all of them share the whole L3 cache. If one core per module is activated, it has 2MB L2, which it would have to share with the other core otherwise.

  6. #6
    Join Date
    Jul 2008
    Location
    Greece
    Posts
    3,798

    Default

    Quote Originally Posted by ifkopifko View Post
    2RealNC> It is not the case with BD modules. No matter if one per module or two per module, all of them share the whole L3 cache. If one core per module is activated, it has 2MB L2, which it would have to share with the other core otherwise.
    I'm afraid I didn't understand the above.

    In my thinking, it seems better to have a larger, shared cache rather than multiple smaller, non-shared ones.

  7. #7
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by RealNC View Post
    I don't view it that way. If you're gonna have, say, 8MB cache on 4 cores, it's better to make it shared rather than 2MB per core. That way, on loads that involve fewer cores the cache increases (on a two-thread load you have 4MB per core).

    But of course that view comes from someone who doesn't know the details behind CPU cache memory :-P
    The issue is not how you do it, is that is not possible, is that reduced cache but in lower levels, is worse if you simply combine with a bigger cache. Having a big shared cache theoretically is better but in multi-threaded context can make things slower. Let's take a typical case: you make a 'make -J9' (meaning 9 processes/jobs will try to stress all your 8 CPU cores). As the processes start, every time a process switch happen (even in a multiple small caches), the new process will simply blank the cache. This is bad, but it is not that bad, as the code of the source code may fit in that cache fairly right. And even it would happen that blanking, the cache of the one core will be make a theoretical worst case slow down of around 12.5 percent (in accessing memory). Let's get to a make -J8 with a shared cache: this will make that you start process 1 to process 8 initially and they override one's other cache, but when some things appear to be common, one of processes ends, and a new process, process9 will simply clear once again the shared cache making it in vain all that "locality" that was given to the cache.
    Another issue is how cache are made, L1, is closest (as distance and CPU cycles) to the math units, memory unit, and so on. The L2 is a bit further apart, and L3 (that we think as shared cache) is the slowest. The electrons have to "walk more" to get the data from the cache to any specific core. So the solution that most people will say is to have 1 MB L1 for every core (the opposite of the 8 M L3 shared). This will make the synchronizations of the CPU impossible (I mean possible like in Athlon X4, but slower, as L3 is used also for syncing the data between cores).
    The cache hit/miss ratio and branch prediction (if you have a cache miss, you would have to make the predictor to make computations right (in advance) in the time you wait for memory) is very hard to get it right, and to succeed it the work is on two fronts: how software is written (for Bulldozer that your multi-threaded process will try to have the core most used logic to fit in 2 MB) and how to not get in the architecture bottlenecks.

  8. #8

    Default

    Quote Originally Posted by elanthis View Post
    You don't generally _want_ data to be shared betweens threads. That would just mean your threading architecture is all wrong and being hamstrung by data dependencies/locking.

    Shared caches save money. They don't improve speed. (generally speaking, of course)
    So what we really want is tests with both, to see what is faster, and how the Bulldozers with less cores behave.

    And both AMD's 6-core and Intel's 2600 (Edit: the 2630qm has hyperthreading, so it probably works as substitute) are really missing here for the full picture.
    Last edited by AnonymousCoward; 10-26-2011 at 07:07 AM.

  9. #9
    Join Date
    Sep 2011
    Posts
    29

    Default

    Good article. You did a great job of showing the difference between 8 semi-real cores vs. hyperthreading.

  10. #10
    Join Date
    Dec 2008
    Location
    San Bernardino, CA
    Posts
    232

    Default

    Quote Originally Posted by nepwk View Post
    Good article. You did a great job of showing the difference between 8 semi-real cores vs. hyperthreading.
    Agreed. Thank you Michael, this was a very informative article!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •