For desktop use, I am happy using ext4. If distros decide to enable btrfs by default later, that's fine, too. All the desktop systems I've built do not use RAID; if it has two hard disks then I just use them as two independent disks. I usually have a different operating system on each, though, which makes RAIDing them impossible (in the case of Windows and Linux dual boot). Sure, my desktop does a lot of I/O, but what I really care about is read performance (for load times) -- the only time I do really heavy write stuff is when I am installing software (which is a one-time hit per application). For smaller writes that occur as part of normal application usage, what I really want is for the small writes to be responsive -- blocking I/O on these small writes (such as maintaining configuration settings in a sqlite database) should involve absolutely minimal blocking on the app side, so the app can continue with what it's doing.

I tend to back up data that I really care about, so industrial strength data integrity is less important. But read performance... man... I boot up and reboot all the time testing stuff and installing stuff that requires a reboot, etc. Loading apps grinds the disk like crazy. Booting up my main Ubuntu desktop (GNOME) takes almost as long as it does to load the desktop on my loaded-down Windows 7. This is a simple consequence of having a very large number of programs installed, which sometimes start things when you login to your session, and all those icons have to be read from disk, etc. I want the best read performance for the desktop, and it looks like ext4 is that.

I also run a dedicated server with four 1.5TB HDDs. I have been seriously debating which file system (and indeed, which operating system) to use for this server. For isolation and security, I have a policy of having an absolutely minimal host OS on the server. The host OS should be as clean and reliable as possible. It needs to set up networking, bring up SSH, and start the guest OSes that contain the actual services. It's a multi-purpose server, with different people in control of sometimes an entire guest instance, but I have a fairly good idea of exactly what software runs in each guest, even when it's under the control of someone else.

The server has a Core i7 975 @ 3.33 GHz, 12GB DDR3, the aforementioned 4 x 1.5TB 7200rpm HDDs, and a 100Mbps symmetrical uplink sitting on a Level3 Frankfurt, Germany backbone.

Currently, the host OS is Fedora 13, but with most of the default programs purged. For security and bugfixes, I have periodic planned downtimes where I update the core packages on the host from the Fedora repos, as well as compile the latest stable Linux kernel. I have a .config that I migrate from version to version and I carefully consider each option, to cut down on the number of modules that are built, disable potential security risk features like Kprobes and /dev/mem, etc.

Since some of my users want to run services that require very fast packet I/O (real-time FPS gaming), I use a voluntary preemption model, but I stick with the 100 Hz timer. I haven't had any complaints about responsiveness with these settings.

For guest isolation I use Linux-VServer. Obviously this means I have to patch my kernel with the latest Linux-VServer patches. Therefore I can't update my kernel until upstream Linux-VServer maintainers release the VServer patch against the latest stable kernel. The wait is usually not long.

The interesting part about Linux-VServer is that it has zero guest overhead, because it's just a container solution. It uses the same kernel in all the guests as the host, but each guest environment is "private": only the host OS can poke into the guests, but guests can't poke at each other. This is true even though all the guests share the same filesystem (as far as the host and the disk is concerned). Something similar to chroot is used for file system isolation, but presumably it's even more robust. Guests can't load kernel modules, but I don't care; they don't need to. The biggest advantage for me is that a single filesystem is used for all the guests, which is faster than running a filesystem on top of another filesystem (like you normally do with guests: you store their image as a file on the host filesystem, and within that image is the guest filesystem.)

I also run a KVM (Kernel Virtual Machine) guest, Windows Server 2008 R2 Standard. This is to support one of my users whose in-house server software is currently not ported to run on Linux. Obviously this is slower than Linux-VServer, but the performance is acceptable for the apps running in there.

Now, for the relevant stuff regarding this article.

Currently, I am using the Linux Multiple Devices md5_raid subsystem, and LVM2 on top of that, and ext4 on top of that. Reads are fast, as expected, but write speed is terrible! More importantly, though, the CPU usage from md5_raid is simply out of this world when any significant amount of writing is taking place. And since it occurs in a kernel thread, it basically blocks everything out while the writes take place.

This isn't a problem for small writes, but if you were to copy a 5GB file, you could effectively suspend all networking and make the entire box unresponsive (yes, even from a guest) while md5_raid calculates all the parity bits. That makes me really insecure about the stability of the system, even though very large writes usually don't happen often enough to make a difference in practice.

I have tried many MD tweaks to try and resolve this, but it seems that it is just an inherent misfeature of Linux-MD RAID5.

Right now, I don't think I am in a position to move to OpenSolaris (because I am not confident in Oracle's willingness to continue to support it), nor FreeBSD (because I am not confident in FreeBSD's ability to act as a virtualization host for *any* full OS virt solution). So, although I really want the data integrity features and predictable performance of ZFS, I can't switch to an operating system that supports ZFS at production quality level (ZFS-FUSE is not production quality).

That leaves btrfs. I know it doesn't support RAID5; maybe I can use its RAID1+0 instead. But btrfs is still experimental! I would be somewhat timid putting it on my server now. I consider myself a fairly agile sysadmin; I'm not the sort who runs RHEL and only uses the filesystem that is recommended by default by RHEL. But I am also not going to take chances with my data. If the btrfs authors themselves don't claim their filesystem is stable, how can I deploy it on my server?

My opinion on this seems to change every week. One week, I will get a swell of confidence in OpenSolaris, and say that my plan is to migrate to OSOL as my host, and use VirtualBox or Xen to virtualize Windows, and Zones to replace Linux-VServer. The next week, I will say, FreeBSD is the best because it supports ZFS as well as OSOL, but at least the people maintaining it are still around. Then again, I will come back and say, what if I ever need to deploy a binary-only Linux software, or open source software that uses something unique to the Linux kernel? So then I think that maybe I should stay with Linux, but I really, really dislike Multiple Devices, and btrfs isn't ready.

So, I have been in a holding pattern for about 6 months on this. I have known that I really need to get away from Linux-MD; that I can (and should) move to RAID 1+0; and that my safest bet is probably to wait for btrfs to mature, as this will require the minimal amount of reconfiguration and learning new things. And I can continue to reap the benefits of the Linux kernel's performance, as has been demonstrated in every benchmark comparing Linux vs BSD and Solaris. I also don't want to give up Linux-VServer, since it has served me very well to date, and KVM really speeds up the network performance in the Windows guest thanks to the virtio-net drivers.

I'm still very undecided on this, and I'd love to know what others think about my predicament. My inaction on this will probably either result in the RAID array getting degraded and then a catastrophic data loss; or I will just have to deal with repeated instances (every few months) of the server appearing to "go down" when some user decides to copy a multi-gigabyte file. What I want to do is jump ship just before the ship I'm on sinks into the water, and land feet first on a much sturdier ship. It's critical that I time this right: if I migrate too soon, I might end up with an unsupported OS (if Oracle pulls the plug on OpenSolaris), or a buggy filesystem that's either too slow or loses data (if I use btrfs before the kinks are worked out). And if I use FreeBSD, I'm really venturing into the unknown, because I have no idea how I am going to support containers there, or how to virtualize Windows there. Not to mention my only substantial sysadmin work so far has been on OpenSolaris, Fedora and Ubuntu; I've barely touched any BSD system.