Creator of ZFS, Jeff Bonwick, said that Linux scales bad. Many Unix people (including Kebabbert) say that Linux scales bad.

Linux supporters say that Linux scales excellent, they say Linux scales to 1.000s of cores. So what is the deal, does Linux scale bad or what?

The thing is, Linux scales excellent on HPC servers (a big cluster, a bunch of PCs sitting on a fast network). Everybody say this, including Unix people. No one has denied that Linux scales excellent on a cluster. It is well known that most of super computers run Linux. And those large super computers with 1000s of cores, are always a big cluster. Google runs Linux on a cluster of 900.000 servers.




The famous Linux SGI Altix server with 1000s of cores is such a HPC clustered server running a single Linux kernel image. This SGI Altix server has many thousands of cores:
http://www.sgi.com/products/servers/altix/uv/
Support for up to 16TB of global shared memory in a single system image enables Altix UV to remain highly efficient at scale for applications ranging from in-memory databases to a diverse set of data and compute-intensive HPC applications.
http://www.hpcprojects.com/products/...product_id=941
CESCA, the Catalonia Supercomputing Centre, has chosen the SGI Altix UV 1000 for its HPC system



Here we have another Linux HPC server. It has 4.096 up to 8.192 cores and runs a single Linux image, just as the SGI Altix server:
http://www.theregister.co.uk/2011/09..._amd_opterons/
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit,"
A programmer writes about this server:
I tried running a nicely parallel shared memory workload (75% efficiency on 24 cores in a 4 socket opteron box) on a 64 core ScaleMP box with 8 2-socket boards linked by infiniband. Result: horrible. It might look like a shared memory, but access to off-board bits has huge latency.
Thus, we see that these servers are just HPC servers. And we if look at the workload benchmarks for instance, the SGI Altix server they are all parallell benchmarks. Clustered workloads.




However, let us talk about SMP servers. SMP servers are basically, a single big fat server that has up to 64 cpus, weighs 1.000kg and costs many millions of USD. For instance, the IBM Mainframe z10 has 64 cpus. The newest IBM z196 Mainframe has 24 cpus. The biggest IBM Unix server, the P795 has 32 cpus. The earlier IBM P595 for the TPC-C world record costed 35 million USD list price. The biggest HP-UX server Superdome has 32 cpus or so. The biggest Solaris server has today 64 cpus.

Why dont IBM just insert another 64 cpus and reclaim all world records from Oracle? Or insert another 128 cpus? Or even 512 cpus? Or insert 1024 cpus? No, it is not possible. There are big scalability problems on as many as 64 cpus when you use a SMP server.

On these kind of SMP servers, Linux scales bad. The biggest Linux SMP servers today have 8 cpus, which are the normal x86 servers that for instance Oracle/HP sells. On a SMP server, Linux have severe problems to scale. The reason is explained by Ext4 creator Ted Tso:
thunk.org/tytso/blog/2010/11/01/i-have-the-money-shot-for-my-lca-presentation/
...Ext4 was always designed for the “common case Linux workloads/hardware”, and for a long time, 48 cores/CPU’s and large RAID arrays were in the category of “exotic, expensive hardware”, and indeed, for much of the ext2/3 development time, most of the ext2/3 developers didn’t even have access to such hardware. One of the main reasons why I am working on scalability to 32-64 nodes is because such 32 cores/socket will become available Real Soon Now...



We also see the SAP benchmarks. Linux used slightly faster CPUs, and slightly faster DIMM memory sticks, and still Solaris got 23% higher score. The Linux cpu utilization was 87% which is quite bad on 48 cores. Solaris had 99% cpu utilization on 48 cores, because Solaris is targeted for big SMP servers. Solaris uses the cores better, and that is the reason Solaris got higher benchmarks.

Linux, 48 core, AMD Opteron 2.8 GHz
download.sap.com/download.epd?context=40E2D9D5E00EEF7CCDB0588464276 DE2F0B2EC7F6C1CB666ECFCA652F4AD1B4C

Solaris, 48 core, AMD Opteron 2.6GHz
http://download.sap.com/download.epd...11DE75E0922A14

Solaris used 256 GB RAM, and Linux used 128 GB RAM. The reason is because if Linux HP server wants to use 256 GB RAM, then Linux must use slower DRAM memory sticks. So HP chose fast DRAM memory sticks which had lower storage capacity. But SAP benchmarks require only 48GB RAM or so, so it is not important if a server uses 128GB or 256GB or 512GB RAM.




There dont exist big SMP Linux servers on the market for sale. Thus, the Linux kernel developers have no hardware to develop on, just as Ted Tso explained.

Thus, Linux scales excellent on clusters HPC servers. But Linux dont scale too well on SMP servers. Thus, Bonwick is right when he says that Linux scales bad on SMP servers. There are no big SMP Linux servers for sale on the market, there are no 16 cpu Linux servers, nor 24 cpu servers.

There are Linux SMP servers: 4, 6, 8 cpu servers and then there are Linux HPC servers: 4096 core servers or more. But where are the 32 cpu Linux SMP servers? No one sells them. Very difficult to make. Big scalability problems. You need to rewrite the OS, and build specialized hardware that costs millions of USD.