Here are my 2 cents:
First, Michael, could you please also create normalized versions of the graphs on page 4 and 5 of how the panda board cluster scales? This would be most helpful, since the cluster's parallelization would be more apparent in this format.
Second, looking at the fusion benchmarks (this is just the CPU I happened to choose to do a quick analysis for), while the panda board cluster is indeed ~3x faster, I think there is more to the story. What if we had a similar cluster composed of 4 amd fusion systems?
Looking a the costs on ebay, we could build a bare-bones cluster:
- amd fusion E-350 + Asus E35M1 PRO motherboard: $120 (there's a $20 rebate, but I'm leaving this out, so it could potential be $100)
- 4 gb ram: $50
- 64gb ocz ssd: $60
TOTAL: $230. Four of these would cost $920 and put the cluster's throughput above the panda board cluster. However, lets suppose that the parallelization is quite poor and scales to the exact same throughput as the panda board cluster (makes the following calculations easier).
The difference in cost between the two systems is $280 and for the NAS parallel EPC benchmark, the amd systems would be at 180W while the panda board was at 30W, difference of 150W. How long would you have to run these systems, continuously before it makes sense to by the panda board cluster (supposing 20c/kwh)
280 * 100 * 1000 / (20 * 150) = 9333 hrs ~ 389 days.
I'm not trying to say that panda board is better/worse than the other systems. I'm really only trying to show that in some cases the cost to become efficient outweighs the gains from the efficiency and this, for me, it also a very important quantity.
I don't know how the powervr graphics compares to the fusion graphics card, but if you were doing opencl/gpu computations, this would also add another factor to which system you would go for.
if you were looking to compare the OpenCL/GPU computation then you need to be sourcing and using the current 4212 (previously called 4412, and then finally renamed the Exynos Quad) 1.4GHz A9 with arm Midgard Mali T-604 (with 4 cores not 8 yet) Evaluation Boards at least as it is only the current ARM Midgard architecture that covers the full OpenCL and other GPU compute spec's
Originally Posted by FourDMusic
come the 3rd quarter 2012 i believe Exynos Quad (and other Quad vendors) will also come in 1.6+GHz (perhaps even 8+ Midgard gfx cores) as well as the current 1.4GHz so there's also clock parity with the lower power x86 offerings by then OC
Last edited by popper; 06-15-2012 at 01:20 PM.
Absolutely, one power supply powering the whole shebang vs. a separate power converter for each will certainly increase efficiency.
Originally Posted by AJenbo
> while that's true OC you will have to wait until around 3rd quarter for these Exynos 5 developer boards to appear in bulk it seems
Q3 is only in 2 weeks those chips (at least Samsung version) are already available for motherboard dev since 6 month. There are also "rumors of" Galaxy note 2 in october with Cortex A15. But the most important is that Cortex A15 @2Ghz is twice faster than Cortex A9@1,2Ghz and consume less power. Some showed that at last test on A9, Samsung Cortex A9, few month ago was already far faster than pandaboard and every Atom chips available. The phoronix test suite was not optimized at all for ARM and Samsung is far more active in Linux developpement for their products.
How many ARM processors would it take to equal the throughput of a 3770K at 4.2 GHz turbo mode with 1666 MHz DDR3?
Answer: More than there are for sale in your local Verizon store.
Now try comparing ARM processors to the Xeon E5-2687W... ho boy, look out. It may be Sandy Bridge, but it's a BEAST.
This is a cost comparison — relevent, except that I don't think anybody would seriously use 2-core fully-equiped boards. The ARM chips themselves are cheap (I heard prices of $5-7 somewhere; not certain about it). Stripping a lot of the other stuff off reduces cost and power consumption per CPU. So I think the ARM server boards, when they arrive, will be a lot better in terms of power consumption and price, than this cluster. Ivy Bridge may not look so pretty then, even without A15.
Originally Posted by FourDMusic
Michael you should try to include the High Performance Linkpack benchmark (the one used in the top 500) for MPI related runs. Unluck this is not trivial. The Atlas library must be recompiled to be tuned for the hardware (otherwise you can have a very very big performance loss, like 50%), otherwise an hardware specific BLAS implementation must be used (for intel hardware there is MKL, but for ARM i don't think there is one). The ethernet connection and the ammount of RAM might be a bottleneck. More RAM is available, more HPL is efficent. It is not the best bench for FLOPS, but it is the sad standard
Anyway it would be very very nice to have the HPL included in PTS if possible. The atlas compilation mighe just be added to the beginning of the test. It takes hours but it is the only way to have decent performances.
Pandas as a NAS is nonsense
honestly speaking, you need SATA/SSD for NAS and not just slow SDHC cards. Pity you don't have an option to give a try to cluster of for example free scale i.MX3 Quick Start Boards or new i.MX6. That may be better than pandas IMHO.
On ARM's side, a special multicore chip can be developed say a hypotical 12 core A9 which may outperform this cluster. The most specialized version i know is Claxeda Energycore. Which are 4 core A9 with 4mb L2 cache and 4chips on board (16 cores per board).
On the cost and x86's side, (3770k + z77 board + 4gb ram + psu costs around 470$). So you can buy 2x 3770k system. Which can be underclocked around 0.9v and can be overclocked to 3.9 ghz with this voltage. I guess ARM system will need lots of time to compansate the compute capacity with low power usage as FourDMusic states.
On the other hand, you could have used a Celeron G530 system on a cheap h61 board where an underclocked G530 consumes as low as 34w under full load (with linpack). So it could have been more interesting.
The only place where it is reasonable to use ARM core is with high number of cores (which you need a special hard macro). 16-32 core Special Arm chips would have been more efficient and powerfull.
how to build
anyone knows what cluster implementation michael was running?
i bought myselve a mini cluster based on the parrallella boards on kickstarter...
not sure what i'm going to run on them cluster wise.