Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 21

Thread: 12-Core ARM Cluster Benchmarked Against Atom, Ivy Bridge, Fusion

  1. #11
    Join Date
    Jan 2012
    Posts
    151

    Default

    Here are my 2 cents:

    First, Michael, could you please also create normalized versions of the graphs on page 4 and 5 of how the panda board cluster scales? This would be most helpful, since the cluster's parallelization would be more apparent in this format.

    Second, looking at the fusion benchmarks (this is just the CPU I happened to choose to do a quick analysis for), while the panda board cluster is indeed ~3x faster, I think there is more to the story. What if we had a similar cluster composed of 4 amd fusion systems?

    Looking a the costs on ebay, we could build a bare-bones cluster:
    - amd fusion E-350 + Asus E35M1 PRO motherboard: $120 (there's a $20 rebate, but I'm leaving this out, so it could potential be $100)
    - 4 gb ram: $50
    - 64gb ocz ssd: $60

    TOTAL: $230. Four of these would cost $920 and put the cluster's throughput above the panda board cluster. However, lets suppose that the parallelization is quite poor and scales to the exact same throughput as the panda board cluster (makes the following calculations easier).

    The difference in cost between the two systems is $280 and for the NAS parallel EPC benchmark, the amd systems would be at 180W while the panda board was at 30W, difference of 150W. How long would you have to run these systems, continuously before it makes sense to by the panda board cluster (supposing 20c/kwh)

    280 * 100 * 1000 / (20 * 150) = 9333 hrs ~ 389 days.

    I'm not trying to say that panda board is better/worse than the other systems. I'm really only trying to show that in some cases the cost to become efficient outweighs the gains from the efficiency and this, for me, it also a very important quantity.

    I don't know how the powervr graphics compares to the fusion graphics card, but if you were doing opencl/gpu computations, this would also add another factor to which system you would go for.

  2. #12
    Join Date
    Jan 2007
    Posts
    459

    Default

    Quote Originally Posted by FourDMusic View Post
    Here are my 2 cents:

    First, Michael, could you please also create normalized versions of the graphs on page 4 and 5 of how the panda board cluster scales? This would be most helpful, since the cluster's parallelization would be more apparent in this format.

    Second, looking at the fusion benchmarks (this is just the CPU I happened to choose to do a quick analysis for), while the panda board cluster is indeed ~3x faster, I think there is more to the story. What if we had a similar cluster composed of 4 amd fusion systems?

    Looking a the costs on ebay, we could build a bare-bones cluster:
    - amd fusion E-350 + Asus E35M1 PRO motherboard: $120 (there's a $20 rebate, but I'm leaving this out, so it could potential be $100)
    - 4 gb ram: $50
    - 64gb ocz ssd: $60

    TOTAL: $230. Four of these would cost $920 and put the cluster's throughput above the panda board cluster. However, lets suppose that the parallelization is quite poor and scales to the exact same throughput as the panda board cluster (makes the following calculations easier).

    The difference in cost between the two systems is $280 and for the NAS parallel EPC benchmark, the amd systems would be at 180W while the panda board was at 30W, difference of 150W. How long would you have to run these systems, continuously before it makes sense to by the panda board cluster (supposing 20c/kwh)

    280 * 100 * 1000 / (20 * 150) = 9333 hrs ~ 389 days.

    I'm not trying to say that panda board is better/worse than the other systems. I'm really only trying to show that in some cases the cost to become efficient outweighs the gains from the efficiency and this, for me, it also a very important quantity.

    I don't know how the powervr graphics compares to the fusion graphics card, but if you were doing opencl/gpu computations, this would also add another factor to which system you would go for.
    if you were looking to compare the OpenCL/GPU computation then you need to be sourcing and using the current 4212 (previously called 4412, and then finally renamed the Exynos Quad) 1.4GHz A9 with arm Midgard Mali T-604 (with 4 cores not 8 yet) Evaluation Boards at least as it is only the current ARM Midgard architecture that covers the full OpenCL and other GPU compute spec's

    come the 3rd quarter 2012 i believe Exynos Quad (and other Quad vendors) will also come in 1.6+GHz (perhaps even 8+ Midgard gfx cores) as well as the current 1.4GHz so there's also clock parity with the lower power x86 offerings by then OC
    Last edited by popper; 06-15-2012 at 12:20 PM.

  3. #13
    Join Date
    Dec 2008
    Location
    San Bernardino, CA
    Posts
    231

    Default

    Quote Originally Posted by AJenbo View Post
    You should be able to get a lower power usage by using a single power converter rather than one for each.
    Absolutely, one power supply powering the whole shebang vs. a separate power converter for each will certainly increase efficiency.

  4. #14
    Join Date
    Oct 2011
    Posts
    32

    Default

    > while that's true OC you will have to wait until around 3rd quarter for these Exynos 5 developer boards to appear in bulk it seems

    Q3 is only in 2 weeks those chips (at least Samsung version) are already available for motherboard dev since 6 month. There are also "rumors of" Galaxy note 2 in october with Cortex A15. But the most important is that Cortex A15 @2Ghz is twice faster than Cortex A9@1,2Ghz and consume less power. Some showed that at last test on A9, Samsung Cortex A9, few month ago was already far faster than pandaboard and every Atom chips available. The phoronix test suite was not optimized at all for ARM and Samsung is far more active in Linux developpement for their products.

  5. #15
    Join Date
    Sep 2008
    Posts
    989

    Default

    How many ARM processors would it take to equal the throughput of a 3770K at 4.2 GHz turbo mode with 1666 MHz DDR3?

    Answer: More than there are for sale in your local Verizon store.

    Now try comparing ARM processors to the Xeon E5-2687W... ho boy, look out. It may be Sandy Bridge, but it's a BEAST.

  6. #16
    Join Date
    Jul 2009
    Posts
    220

    Default

    Quote Originally Posted by FourDMusic View Post
    Looking a the costs on ebay, we could build a bare-bones cluster:
    - amd fusion E-350 + Asus E35M1 PRO motherboard: $120 (there's a $20 rebate, but I'm leaving this out, so it could potential be $100)
    - 4 gb ram: $50
    - 64gb ocz ssd: $60

    TOTAL: $230. Four of these would cost $920 and put the cluster's throughput above the panda board cluster. However, lets suppose that the parallelization is quite poor and scales to the exact same throughput as the panda board cluster (makes the following calculations easier).

    The difference in cost between the two systems is $280 and for the NAS parallel EPC benchmark, the amd systems would be at 180W while the panda board was at 30W, difference of 150W. How long would you have to run these systems, continuously before it makes sense to by the panda board cluster (supposing 20c/kwh)

    280 * 100 * 1000 / (20 * 150) = 9333 hrs ~ 389 days.

    I'm not trying to say that panda board is better/worse than the other systems. I'm really only trying to show that in some cases the cost to become efficient outweighs the gains from the efficiency and this, for me, it also a very important quantity.
    This is a cost comparison relevent, except that I don't think anybody would seriously use 2-core fully-equiped boards. The ARM chips themselves are cheap (I heard prices of $5-7 somewhere; not certain about it). Stripping a lot of the other stuff off reduces cost and power consumption per CPU. So I think the ARM server boards, when they arrive, will be a lot better in terms of power consumption and price, than this cluster. Ivy Bridge may not look so pretty then, even without A15.

  7. #17

    Default

    Michael you should try to include the High Performance Linkpack benchmark (the one used in the top 500) for MPI related runs. Unluck this is not trivial. The Atlas library must be recompiled to be tuned for the hardware (otherwise you can have a very very big performance loss, like 50%), otherwise an hardware specific BLAS implementation must be used (for intel hardware there is MKL, but for ARM i don't think there is one). The ethernet connection and the ammount of RAM might be a bottleneck. More RAM is available, more HPL is efficent. It is not the best bench for FLOPS, but it is the sad standard

    Anyway it would be very very nice to have the HPL included in PTS if possible. The atlas compilation mighe just be added to the beginning of the test. It takes hours but it is the only way to have decent performances.

  8. #18
    Join Date
    Feb 2009
    Posts
    9

    Default Pandas as a NAS is nonsense

    Hello,
    honestly speaking, you need SATA/SSD for NAS and not just slow SDHC cards. Pity you don't have an option to give a try to cluster of for example free scale i.MX3 Quick Start Boards or new i.MX6. That may be better than pandas IMHO.
    Karel

  9. #19
    Join Date
    Apr 2012
    Posts
    5

    Default

    On ARM's side, a special multicore chip can be developed say a hypotical 12 core A9 which may outperform this cluster. The most specialized version i know is Claxeda Energycore. Which are 4 core A9 with 4mb L2 cache and 4chips on board (16 cores per board).

    On the cost and x86's side, (3770k + z77 board + 4gb ram + psu costs around 470$). So you can buy 2x 3770k system. Which can be underclocked around 0.9v and can be overclocked to 3.9 ghz with this voltage. I guess ARM system will need lots of time to compansate the compute capacity with low power usage as FourDMusic states.

    On the other hand, you could have used a Celeron G530 system on a cheap h61 board where an underclocked G530 consumes as low as 34w under full load (with linpack). So it could have been more interesting.

    The only place where it is reasonable to use ARM core is with high number of cores (which you need a special hard macro). 16-32 core Special Arm chips would have been more efficient and powerfull.

  10. #20
    Join Date
    Nov 2012
    Posts
    1

    Default how to build

    anyone knows what cluster implementation michael was running?
    kerrighed? openssi?
    i bought myselve a mini cluster based on the parrallella boards on kickstarter...
    not sure what i'm going to run on them cluster wise.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •