Page 1 of 3 123 LastLast
Results 1 to 10 of 58

Thread: Linux vs Solaris - scalability, etc

Hybrid View

  1. #1
    Join Date
    Nov 2008
    Posts
    418

    Default Linux vs Solaris - scalability, etc

    Creator of ZFS, Jeff Bonwick, said that Linux scales bad. Many Unix people (including Kebabbert) say that Linux scales bad.

    Linux supporters say that Linux scales excellent, they say Linux scales to 1.000s of cores. So what is the deal, does Linux scale bad or what?

    The thing is, Linux scales excellent on HPC servers (a big cluster, a bunch of PCs sitting on a fast network). Everybody say this, including Unix people. No one has denied that Linux scales excellent on a cluster. It is well known that most of super computers run Linux. And those large super computers with 1000s of cores, are always a big cluster. Google runs Linux on a cluster of 900.000 servers.




    The famous Linux SGI Altix server with 1000s of cores is such a HPC clustered server running a single Linux kernel image. This SGI Altix server has many thousands of cores:
    http://www.sgi.com/products/servers/altix/uv/
    Support for up to 16TB of global shared memory in a single system image enables Altix UV to remain highly efficient at scale for applications ranging from in-memory databases to a diverse set of data and compute-intensive HPC applications.
    http://www.hpcprojects.com/products/...product_id=941
    CESCA, the Catalonia Supercomputing Centre, has chosen the SGI Altix UV 1000 for its HPC system



    Here we have another Linux HPC server. It has 4.096 up to 8.192 cores and runs a single Linux image, just as the SGI Altix server:
    http://www.theregister.co.uk/2011/09..._amd_opterons/
    The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit,"
    A programmer writes about this server:
    I tried running a nicely parallel shared memory workload (75% efficiency on 24 cores in a 4 socket opteron box) on a 64 core ScaleMP box with 8 2-socket boards linked by infiniband. Result: horrible. It might look like a shared memory, but access to off-board bits has huge latency.
    Thus, we see that these servers are just HPC servers. And we if look at the workload benchmarks for instance, the SGI Altix server they are all parallell benchmarks. Clustered workloads.




    However, let us talk about SMP servers. SMP servers are basically, a single big fat server that has up to 64 cpus, weighs 1.000kg and costs many millions of USD. For instance, the IBM Mainframe z10 has 64 cpus. The newest IBM z196 Mainframe has 24 cpus. The biggest IBM Unix server, the P795 has 32 cpus. The earlier IBM P595 for the TPC-C world record costed 35 million USD list price. The biggest HP-UX server Superdome has 32 cpus or so. The biggest Solaris server has today 64 cpus.

    Why dont IBM just insert another 64 cpus and reclaim all world records from Oracle? Or insert another 128 cpus? Or even 512 cpus? Or insert 1024 cpus? No, it is not possible. There are big scalability problems on as many as 64 cpus when you use a SMP server.

    On these kind of SMP servers, Linux scales bad. The biggest Linux SMP servers today have 8 cpus, which are the normal x86 servers that for instance Oracle/HP sells. On a SMP server, Linux have severe problems to scale. The reason is explained by Ext4 creator Ted Tso:
    thunk.org/tytso/blog/2010/11/01/i-have-the-money-shot-for-my-lca-presentation/
    ...Ext4 was always designed for the “common case Linux workloads/hardware”, and for a long time, 48 cores/CPU’s and large RAID arrays were in the category of “exotic, expensive hardware”, and indeed, for much of the ext2/3 development time, most of the ext2/3 developers didn’t even have access to such hardware. One of the main reasons why I am working on scalability to 32-64 nodes is because such 32 cores/socket will become available Real Soon Now...



    We also see the SAP benchmarks. Linux used slightly faster CPUs, and slightly faster DIMM memory sticks, and still Solaris got 23% higher score. The Linux cpu utilization was 87% which is quite bad on 48 cores. Solaris had 99% cpu utilization on 48 cores, because Solaris is targeted for big SMP servers. Solaris uses the cores better, and that is the reason Solaris got higher benchmarks.

    Linux, 48 core, AMD Opteron 2.8 GHz
    download.sap.com/download.epd?context=40E2D9D5E00EEF7CCDB0588464276 DE2F0B2EC7F6C1CB666ECFCA652F4AD1B4C

    Solaris, 48 core, AMD Opteron 2.6GHz
    http://download.sap.com/download.epd...11DE75E0922A14

    Solaris used 256 GB RAM, and Linux used 128 GB RAM. The reason is because if Linux HP server wants to use 256 GB RAM, then Linux must use slower DRAM memory sticks. So HP chose fast DRAM memory sticks which had lower storage capacity. But SAP benchmarks require only 48GB RAM or so, so it is not important if a server uses 128GB or 256GB or 512GB RAM.




    There dont exist big SMP Linux servers on the market for sale. Thus, the Linux kernel developers have no hardware to develop on, just as Ted Tso explained.

    Thus, Linux scales excellent on clusters HPC servers. But Linux dont scale too well on SMP servers. Thus, Bonwick is right when he says that Linux scales bad on SMP servers. There are no big SMP Linux servers for sale on the market, there are no 16 cpu Linux servers, nor 24 cpu servers.

    There are Linux SMP servers: 4, 6, 8 cpu servers and then there are Linux HPC servers: 4096 core servers or more. But where are the 32 cpu Linux SMP servers? No one sells them. Very difficult to make. Big scalability problems. You need to rewrite the OS, and build specialized hardware that costs millions of USD.

  2. #2

    Default

    Ok, answer to my comments here.

  3. #3
    Join Date
    Jun 2010
    Posts
    71

    Default

    Quote Originally Posted by kraftman View Post
    Ok, answer to my comments here.
    Let the wars begin!

  4. #4
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default

    the most people can not make a different between file-system performance and kernel performance.

    Solaris do have the most advance file-system and the most best implementation of this file-system.

    If you only benchmark single and raw kernel functions linux beat solaris.
    Linux beat all other Operating systems in this way.

    your "SAP" benchmark is more complex and its also a file-system benchmark.

    BTRFS is designed to compete with ZFS but its not ready yet for example there is no checkdisk right now.

    Linux is ready to compete with ZFS with BTRFS in a year or zwo.

  5. #5

    Default

    Kebabbert, you appear to be quite confused when it comes to terminology, and hence your posts on multiprocessor scalability tends to read like complete nonsense. You always state that you "just want to learn and be correct", so let me try to clear things up for you before we even get into the discussion of performance or scaling numbers.
    First, you seem to think that HPC and SMP are comparable and opposing terms. They are not - SMP (Symmetric Multi-Processing) describes a specific architecture for multiprocessor systems. HPC (High Performance Computing) on the other hand is a _usage scenario_ for large multiprocessor systems. HPC can be done on SMP systems, on distributed clusters or recently via GPGPU- it all depends on the specific HPC task.

    Also, "SMP" in and of itself is today a somewhat "fuzzy" term, since there are barely any modern server system on the market nowadays that really fit the hard "SMP" criteria. The large "SMP" systems are almost all NUMA designs.

    In any case, the distinction you're trying to make when you mistakenly contrast "HPC" vs. "SMP" is really between _shared memory multiprocessor systems_ versus _distributed memory multiprocessor systems_.

    The sytems that you like to refer to as "large SMP" are shared memory systems. The "cluster of small systems on a fast network" is a distributed memory system. And then, thanks to the wonders of hypervisors and hardware virtualization, we now have "hybrid" systems like ScaleMP, which essentially creates a virtual shared memory system on top of a distributed memory cluster. In these systems, the "Single OS Image" is really a virtual machine, and the "shared memory" is "faked" via a software layer in the hypervisor. These virtual shared memory systems would theoretically be able to run applications written for shared memory systems, but as shown in the link in your post won't perform well since essentially they will be like a NUMA machine with an absolutely ATROCIOUS R:L NUMA factor. On top of that, cache coherency will have to be done by _software_. I don't even like to think about the performance implications of that...

    Now, once we have these distinctions cleared up, let's just lay down some basic facts:

    1. The SGI Altix UV and Altix 4000/3000 systems are NOT distributed memory systems, they are shared memory systems. They are the same "class" of multiprocessor servers as what you (Kebabbert) would refer to as "large enterprise SMP systems", such as the HP SuperDome, Oracle T & M-series enterprise servers, etc.
    2. The fact that the Altix UV systems add processor nodes in blade format doesn't change fact 1 since what the blades plug into is not a network, but a NUMA interconnect - the equivalent to this in the current Oracle servers would be the Jupiter interconnect used for the crossbars. I've seen some people argue that the M9000 isn't NUMA, since it's a crossbar architecture, but as you can see straight from the
    3. The relatively recent Altix ICE systems ARE clusters. These did not exist until 2010 - every large multiprocessor system sold by SGI prior to Altix ICE has been a shared memory NUMA system.
    4. Even a single AMD Opteron 6100-series CPU contains two NUMA nodes, which really underlines just how clouded the line between SMP and NUMA systems is these days.


    Now some notes on the above - I've seen it argued in a few similar discussions to this that the M9000 is not a NUMA system, and Sun/Oracle likes to present it as such, and this is not as dishonest as it might initially seem. In the vast majority of NUMA systems, each CPU or CPU node has its own local memory with it's own local memory controller, and CPUs from other nodes will need to "traverse" that CPU's memory controller to access the remote memory. The M9000 uses a crossbar architecture that differs from this layout, and which would make at least each CPU board "non-numa" since each CPU has an "equal" distance to memory. BUT - the boards then connect to a system crossbar, and while very very quick, there IS a NUMA factor for cross-CPU-board memory access, making the system NUMA after all. However the NUMA factor is apparently so low that NUMA optimizations are not always beneficial;ref - http://kevinclosson.wordpress.com/20...-gives-part-ii ). Perhaps they should have called this beast FloraServer and marketed it with the "I Can't Believe It's Not SMP!" slogan


    Anyway, now that that's out of the way, let's at least touch on some of the other of your misconceptions that seems to have grown out of this lack of basic understanding of different multiprocessor architectures.

    1. "No large Linux SMP systems" - maybe technically true since there aren't any large "true" SMP systems today at all. Even the "big" commodity boxes from Dell, HP and IBM are NUMA systems, since both the Opteron and the Westmere EX are NUMA designs in multi-socket configurations.

    2. But obviously, the biggest shared memory multiprocessor systems sold today are, in fact, "Linux servers".

    3. Item 2. has been true since 2006, when SGI's Origin 3000 series of NUMA machines were discontinued, and the Altix systems became officially supported in 1024 CPU configurations.

    4. SGI has always been WAY ahead of SUN when it comes to massive shared memory multiprocessor systems:
    First SGI SMP server was launched in 1988 and supported up to 8 CPUs. Sun did not launch its first SMP server until 1991, and only did 4-way SMP at the time.
    In 1993, SGI launched a 36-way SMP server. SUN was at 20-way max.
    In 1996, SGI introduced their first NUMA systems, the Origin 2000, available in up to 128-way configurations that year. Sun introduced the E10000, a 64-way server.
    In 2000 when the Origin 2000 was superseded by the Origin 3000, the max configuration was 512 CPUs. SUN's biggest box was still the E10k.
    In 2001 the Origin 3000 scaled to 512 CPUs (it did so from the get-go), and Sun introduced the R15k - maxing out at 106 CPUs, still less than SGI's initial Origin 2000 offering in 1996.
    In 2003 SGI announced its first Linux-based large scale Itanium Based NUMA systems, the Altix 3000 series. The inital max supported configuration was "Only" 64 CPUs, and SUN had the R25k, with 72 CPUs and 144 "threads". While SGI was way ahead of SUN on hardware scaling, there's also no question that Solaris was ahead of Linux at this point. Nonetheless - already back in 2003, there was a large scale multiprocessor server with 64 CPUs being sold. And...
    In 2006 when the Origin 3000 series was discontinued, the Altix 4000 had already been introduced and scaled to 512 CPUs. SUN's biggest was the M9000 which at the time topped out at 128 real cores, or the 25k with it's 72 cores and 144 threads.

    So as you can see, SGI had plenty of experience at making large scale shared memory multiprocessing and SMP systems longer than SUN, and SGI had all of this knowledge already in-house when they went to work at making Linux scale well on huge number of CPUs.

    I'll get into the whole "but how WELL does Linux vs. Solaris scale on large shared memory multiprocessor systems" in a later post - I think this is TL;DRish enough.

  6. #6
    Join Date
    Nov 2008
    Posts
    418

    Default

    Quote Originally Posted by TheOrqwithVagrant View Post
    Lots of interesting stuff.


    1. The SGI Altix UV and Altix 4000/3000 systems are NOT distributed memory systems, they are shared memory systems. They are the same "class" of multiprocessor servers as what you (Kebabbert) would refer to as "large enterprise SMP systems", such as the HP SuperDome, Oracle T & M-series enterprise servers, etc.
    Ok, this is interesting. Forgive me for expressing this, but I am just used to debate to Kraftman, and as you have seen, those "debates" tend to be quite... strange. I am not used to debate with sane Linux supporters. I have some questions to your interesting and informative post:

    1) You talk about the SGI Altix systems, and call them SMP servers. Why dont people run typical SMP work loads on them, in that case, such as databases? They mostly run HPC workloads.

    2) Why does lot of links refer to the Altix systems as HPC servers?

    3) I read a link on this SGI Altix servers, and the engineers on SGI said "it was easier to construct the hardware than to convince Linus Torvalds to allow Linux to scale to Altix 2048 cores". If it is that easy to construct a true SMP server - is this not strange? I know it is easy to construct a cluster. But to construct a SMP server in less time it takes to convince Linus Torvalds?!

    4) I suspect the Altix systems being cheaper than a IBM P795 AIX server. Why dont people buy a cheap 2048 core Altix server to run SMP workloads, instead of bying an 32cpu IBM P795 server to run SMP workloads? The old 32 cpu IBM P595 server for the old TPC-C record costed 35 million USD, list price. What does the Altix servers cost?

    5) Why dont IBM, Oracle and HP just insert 16.384 cpus into their SMP servers, if it is so easy as you claim to build large SMP servers? Why are all of them still stuck at 32-64 cpus? SGI has gone into 1000s of cores, but everyone else with their Mature Enterprise Unix systems are stuck at 64 cpus. Why is that?
    Last edited by kebabbert; 11-15-2011 at 08:49 AM.

  7. #7

    Default

    Quote Originally Posted by kebabbert View Post
    Ok, this is interesting. Forgive me for expressing this, but I am just used to debate to Kraftman, and as you have seen, those "debates" tend to be quite... strange. I am not used to debate with sane Linux supporters. I have some questions to your interesting and informative post:

    1) You talk about the SGI Altix systems, and call them SMP servers. Why dont people run typical SMP work loads on them, in that case, such as databases? They mostly run HPC workloads.

    2) Why does lot of links refer to the Altix systems as HPC servers?
    For the same reason Oracle/SUN are _not_ referring to their servers as HPC servers, even though they could (and have been) used for that - it's not the company's primary market. SGI has traditionally had two primary markets; HPC and graphics. The graphics side has largely disappeared because commodity graphics cards have become so insanely powerful that there really is no market left for specialized million-dollar visualization. Likewise, their HPC business took big hit because for the majority of HPC workloads, a distributed memory cluster solution is often the much better value proposition. However, there are still some HPC workloads out there that simply doesn't scale well on distributed memory systems and where a huge multi-terabyte dataset needs to be kept all in memory at once and be accessible to to all threads. This is the rather tiny niche market for giant shared memory HPC systems like this, but fortunately for SGI, they are the ONLY ones to go to for this type of system, which means they are pretty much guaranteed to sell a few 1000+ core megasystems to entities such as the DoD, NSA and NASA - and a few of those systems sold is enough to recoup development costs and keep the business afloat.

    [QUOTE=kebabbert;238639]

    3) I read a link on this SGI Altix servers, and the engineers on SGI said "it was easier to construct the hardware than to convince Linus Torvalds to allow Linux to scale to Altix 2048 cores". If it is that easy to construct a true SMP server - is this not strange? I know it is easy to construct a cluster. But to construct a SMP server in less time it takes to convince Linus Torvalds?!

    [QUOTE]

    If you look at the little SGI history section in my previous post, you'll realize that by the time SGI was building the Altix version that scaled to 2048 CPUs, they already had 10 years+ experience with the NUMAlink architecture. Scaling the Altix up from 512 to 2048 CPUs probably wasn't all that hard compared to, say, designing the first Origin 2000 systems, which was truly groundbreaking. On the other hand, Linus is very protective of the Linux "core" and whenever a company tries to merge changes to the core that benefits no one but them, they are going to have to make damn sure that those changes don't _hurt_ any other use case before they get merged with mainline. In the case of scaling to enormous CPU counts, there was a lot of (legitimate) concern and flaws in the early patches which hurt performance on "regular size" servers and embedded systems. Another example of an "epic struggle" with Linus is the Xen project's effort to get dom0 support into mainline Linux, which took 3+ years before the patches were of sufficient quality to be allowed in. Also, Linus's position nowadays, in his own words, is to be "the guy who says 'No'" - basically, he's the "quality control" for patches to the core of the Linux kernel. He does very little new development himself these days.

    Quote Originally Posted by kebabbert View Post
    4) I suspect the Altix systems being cheaper than a IBM P795 AIX server. Why dont people buy a cheap 2048 core Altix server to run SMP workloads, instead of bying an 32cpu IBM P795 server to run SMP workloads? The old 32 cpu IBM P595 server for the old TPC-C record costed 35 million USD, list price. What does the Altix servers cost?
    It's hard to find pricing information for the Altix UV, but for example, a 1152 CPU configuration cost about $3.5 million. It might be "cheap" per cpu compared to a P795 (which I believe is about $4.5 million fully decked out with 256 cores) but that's to be expected since the Altix UV uses commodity Xeon CPUs whereas the P795 has Power 7 CPUs. "Non-HPC" workloads like databases, java app servers and the like are unlikely to be tuned to scale beyond the largest SMP configurations sold by the standard enterprise vendors, so a system bigger than 256 CPUs might simply be wasted for "standard" enterprise SMP workloads. That said, SGI does seem to cautiously be trying to "branch out" their market with the publication of the specjbb benchmarks and becoming an officially supported platform for Oracle 11g R2 last year. I haven't seen any database benchmarks yet, but I'll be very curious to see those numbers.

    Quote Originally Posted by kebabbert View Post
    5) Why dont IBM, Oracle and HP just insert 16.384 cpus into their SMP servers, if it is so easy as you claim to build large SMP servers? Why are all of them still stuck at 32-64 cpus? SGI has gone into 1000s of cores, but everyone else with their Mature Enterprise Unix systems are stuck at 64 cpus. Why is that?
    It's "easy" for SGI because they've been doing it for a long time, and they designed their entire architecture towards the goal of being able to scale up almost indefinitely. This was their target market, and they knew at the time of these decisions that they already had customers for systems of that size. The market for extremely large shared memory systems is not really growing, hence it's not money-well-spent for Oracle, HP or IBM to change their existing architectures toward this goal and try to "muscle in" on the niche that's pretty much owned by SGI. Most of the other "big iron" vendors are increasing their "Max cores per server" largely as a side effect of the trend towards fitting more cores/threads on one CPU die than changing their large servers to support more CPU sockets, hence they get their upward scaling "for free" due to the trends in CPU manufacturing and design.
    Second - there's a very good chance that SGI owns a bucketload of patents on this types of scalable architecture, and if Oracle or IBM tried to design a modular "infinitely expandable" shared memory design like SGI's, they'd be walking into a patent minefield.

    Finally - from what I see at actual customer sites, the large systems these days are VERY rarely used to run a single OS image; most are partitioned, or running hypervisors with a ton of VMs. Hence, the market for single OS image servers larger than what the m9000 and P795 can provide outside of HPC is really vanishingly small. That said, the Altix UV (as opposed to the older Itanium-based Altixes which were MUCH more expensive) does put SGI in a position where it can actually be competitive for the enterprise side of that market, and it will be interesting to see whether they have any success with this.

  8. #8

    Default

    Quote Originally Posted by devsk View Post
    Let the wars begin!
    Damn right, but as far I can't see a Kebbabert's response. I hope you will reply in this thread to my last comment, Kebb. If you replied somewhere else then it will be great if you just copy and past your answer here.

  9. #9
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default

    Quote Originally Posted by kebabbert View Post
    On these kind of SMP servers, Linux scales bad. The biggest Linux SMP servers today have 8 cpus, which are the normal x86 servers that for instance Oracle/HP sells. On a SMP server, Linux have severe problems to scale. The reason is explained by Ext4 creator Ted Tso:
    thunk.org/tytso/blog/2010/11/01/i-have-the-money-shot-for-my-lca-presentation/
    again and again and again you don't get the point.
    you just don't get it.. its not the solaris kernel vs the linux kernel its only ZFS vs ext4.
    and then you show us a benchmark with a file-system bottle neck "SAP"
    but you don't get it. any benchmark with a file-system bottle neck are invalid in testing the speed of a kernel!

    drive your solaris on a fat32 file-system and you will cry!

    but it doesn't matter they work on BTRFS and they work on EXT4.

    and in the future solaris will lose on this part I'm sure.

    and then you get nice "SAP" benchmarks on linux to.

  10. #10
    Join Date
    Nov 2008
    Posts
    418

    Default

    Quote Originally Posted by Qaridarium View Post
    again and again and again you don't get the point.
    you just don't get it.. its not the solaris kernel vs the linux kernel its only ZFS vs ext4.
    and then you show us a benchmark with a file-system bottle neck "SAP"
    but you don't get it. any benchmark with a file-system bottle neck are invalid in testing the speed of a kernel!

    drive your solaris on a fat32 file-system and you will cry!
    These benchmarks are not bottle necked by filesystem. One of the reasons they have lot of RAM is to cache a lot. Have you heard about Disk Cache?



    You talk about "ZFS being the fastest filesystem", that is not true. ZFS is the most _advanced and secure_ filesystem, but ZFS is actually quite slow. The reason is that ZFS protects your data, and does checksum calculations all the time, and burn lot of CPU to do that. If you do a MD5 or SHA-256 on every block, then everything will be slow, yes? Have you done a MD5 checksum on a file? It takes time, yes?



    Ext4 and other filesystems do not protect your data, and skips calculations and are very fast on few disks. As Ext4 creator Ted Tso explains:
    http://phoronix.com/forums/showthrea...904#post181904
    Quote Originally Posted by Ted Tso
    In the case of reiserfs, Chris Mason submitted a patch 4 years ago to turn on barriers by default, but Hans Reiser vetoed it. Apparently, to Hans, winning the benchmark demolition derby was more important than his user's data. (It's a sad fact that sometimes the desire to win benchmark competition will cause developers to cheat, sometimes at the expense of their users.)
    ...
    In the case of ext3, it's actually an interesting story. Both Red Hat and SuSE turn on barriers by default in their Enterprise kernels. SuSE, to its credit, did this earlier than Red Hat. We tried to get the default changed in ext3, but it was overruled by Andrew Morton, on the grounds that it would represent a big performance loss, and he didn't think the corruption happened all that often --- despite the fact that Chris Mason had developed a python program that would reliably corrupt an ext3 file system


    Here is another cheat that Linux developers are doing, to win benchmarks:
    http://kerneltrap.org/mailarchive/li...6359123/thread
    "This is really scary. I wonder how many developers knew about it especially when coding for Linux when data safety was paramount. Sometimes it feels that some Linux developers are coding to win benchmarks and do not necessarily care about data safety, correctness and standards like POSIX."



    One person said that when he does a XFS fsck of 16TB data, it takes 20 minutes or so. That is extremely quick. ZFS does a check on 16TB for many hours. Thus, XFS is much faster. But, if XFS does a fsck check of 16TB in 20 minutes, it means XFS checks 13.000MB/sec. That is not possible on a 16TB small array with 8 disks, because one typical disk give 100MB/sec. The only conclusion is that XFS does not check all the data, only some of the data. XFS fsck skips lot of controls of data, that is the only way that XFS can reach 13GB/Sec on 8 disks. Thus, XFS is fast, but not safe.



    A Linux user writes:
    http://hydra.geht.net/tino/howto/zfs/
    ZFS uses ...loads of CPU: If you have more than one CPU, expect to dedicate 1 CPU to ZFS, when your programs do IO on ZFS! If you have a single CPU, expect 95% CPU spent for ZFS when copying. (This disadvantage is due to internal compression, bookkeeping, checksum and error correction of the ZFS code. It cannot be evaded
    Now ZFS have been improved, so it does not burn as many cpu cycles for a simple copy, but still ZFS burns lot of cpu cycles because of data protection calculations.



    There are data protection calculations all the time, on every block. The cpu burns lot of cycles to check that all data is correct. If you do a SHA-256 checksum on every block that is read, it does burn CPU cycles, I hope you know how slow checksum calculations are?
    http://arstechnica.com/staff/fatbits/2005/12/2049.ars
    ZFS trades CPU cycles for peace of mind. Every block is checksummed after it arrives from the disk (or network or whatever).

    One thing Jeff Bonwick's post doesn't go into, however, is the actual cost in CPU cycles of all this checksumming....is it well suited to more traditional desktop CPUs with one or two cores? How much CPU overhead is acceptable for a desktop file system driver?

    My only regret is that I have but one CPU core to give for my data...


    Sure, on many disks ZFS is faster, because it scales better on 128 disks and more. Solaris has no problem of using many cpus, and 128 disks burn lot of cpu, but that load is distributed to many cpus. Thus, you need many disks and much cpu, for ZFS to be fastest.

    On single disk/few disks I expect most filesystems to be faster. Because other filesystems dont burn lot of CPU cycles protecting your data. And if we look at the benchmarks, ZFS is never the fastest on single disk.

    The point of using ZFS, is it protects your data. What do you prefer, a slow but reliable filesystem, or a fast but unsafe filesystem that might corrupt your data? All the fancy functionality of ZFS is secondary to me. I mean it. If ZFS lacked all those functions, I would still use ZFS because I want to protect my data. That is the most important to me. I dont mind if it is a bit slower, as long as it is safe.

    Because of all this cpu cycles that ZFS burn as soon as there is a read of any block, Solaris is punished. Thus, Solaris benchmarks should actually be higher.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •