Btrfs RAID: Built-In/Native RAID vs. Mdadm
Phoronix: Btrfs RAID: Built-In/Native RAID vs. Mdadm
Last month on Phoronix I posted some dual-HDD Btrfs RAID benchmarks and that was followed by Btrfs RAID 0/1/5/6/10 testing on four Intel solid-state drives. In still testing the four Intel Series 530 SSDs in a RAID array, the new benchmarks today are a comparison of the performance when using Btrfs' built-in RAID capabilities versus setting up a Linux 3.18 software RAID with Btrfs on the same hardware/software using mdadm.
These benchmarks are very interesting, but the inevitable "why?" question arises. What causes read performance under mdadm to be better than native? Why isn't mdadm write performance as good as native?
A good journalist should reach out to the experts to explain behaviors (FS developers etc). This provides more detailed/interesting information in your articles and improves the credibility of the article.
Often this information is buried on page 4 of the article's comments if at all. It would be valuable to have this information in the article from a cited source(s) for your readers.
Don't have the time/resources to reach out and do all of that with being the only one doing all of the tests and writing with it being tough enough as it is to make a living... Long been explained. But as is often the case, now that all of the data is out there, etc, anyone is free to bring it to the appropriate mailing lists or contacts. Anyone can now reproduce the tests themselves using OpenBenchmarking.org / Phoronix Test Suite, etc.
Originally Posted by mufasa72
Basically this is a good test showing where the btrfs implementation still needs optimization. Most notably the raid 1 read speed.
This was the test I was looking to see. I hope that the btrfs folks close the outstanding gaps going forward but it looks very promising. Now, this has been done on SSD's, does it hold true on traditional spindles? (That would be interesting to see too.)
What may be interesting to see going forward is btrfs raid lined up against ext4+mdadm, xfs+mdadm (that's what I use) etc. Although, what may be more of a fair comparison may actually be btrfs raid vs. ext4+LVM+mdadm vs xfs+LVM+mdadm, etc through the different raid levels. I don't think raid 0 is particularly relevant though IMO. (I personally use the latter so would be interested to see how it stacks up.) If I understand correctly btrfs actually contains LVM type functionality.
For those looking to try out btrfs themselves, there is also the feature benefit of doing btrfs-native raid: transparent correction of bad data. When blocks are read in, checksums are verified. If there are any errors, Btrfs tries to read from an alternate copy and will repair the broken copy if the alternative copy succeeds.. mdadm doesn't provide that functionality.
I usually am very quick to say to not use btrfs. Because the only good experience was a desktop experience, and in all other cases btrfs always corrupted itself in such a way that we just had to delete 1...4TB of data and start all over.
But once btrfs is really trustworthy I would run it instead of mdadm.
Why? With raid1 mdadm is incapable of telling which disk is correct, short of looking at the event counters and the bitmap.
Btrfs *knows* which one is correct, because the file data is checksummed.
Another note is that raid1 on 4 disks uses 2 disks per file, any 2 disks.
The disks are not in sync at block level, the disks are in sync at data level. And each disk has data checksums on their copy of the file.
mdraid cannot detect errors during scrubbing.
btrfs can detect file checksum errors, and repair to use the good copy.
btrfs can have a mix of raid data (raid6,5,1) and or metadata on the same disks.
So with btrfs you know your data is correct.
But my reluctance to use btrfs has to do with btrfs corrupting it's own metadata, and oopsing (0 pointer dereference) on corrupt metadata. And that is with "vanilla" btrfs: metadata dup, and plain data.
No way to just cut-off the corrupted metadata and continue with whatever is available.
No way to fsck your filesystem. I had a btrfs.fsck run for over 4 months on 250GB of metadata with 12GB of RAM and 400GB of dedicated raid1 swap in the machine just to check if it would succeed, and it finally bailed out on out-of-memory.
It should have been interesting to see a benchmark on normal disks instead of SSDs, as RAID has been created for them more than flash memories (think about lack of trim on raid setups, for example).
I used to have the same problem, where btrfs would accidentally corrupt its superblock - I believe the last time it occurred was about a year ago.
Originally Posted by Ardje
The workaround is to only use it for RAID arrays (incl. RAID0), since it stores a copy of its superblock on both drives. That way, if it refused to mount, you could always mount it via the other drive and it would fix itself.
It's also worth noting there are several ways to fix btrfs errors, and fsck.btrfs is not one of them as it is often symlinked to /bin/true.
The following are techniques for fixing/recovering from errors:
- use btrfsck/'btrfs check' --repair. NOTE: This makes no changes without the --repair flag.
- mount with '-o recovery'. (Good for open_ctree errors.)
- mount with '-o ro' (Useful when mounting results in the kernel attempting to do some cleanup requiring non-existent free space)
- mount with '-o skip_balance' (Prevents resumption of balance operation)
- use 'btrfs restore' to restore files/snapshots directly from the drive. This is a read-only operation, and does not require mounting.
As an avid btrfs user (its on pretty much every disc I own, so when something goes wrong I dump it and submit the bugs upstream) nobody can practically use btrfs until you can say "if something goes wrong, the fsck will fix it, and you don't need to do anything" or "nothing ever goes wrong".
How often is ext4 metadata corrupted anymore? The journal? Rarely, if ever. And in theory btrfs should be even better than ext4 at these things because its natural COW nature means if something fails it should just hold references to the original instance, and it should have redundant copies because its supposed to be next gen.
I have a feeling with the Facebook work going into it 2015 might be the year of the btrfs desktop. I imagine by 3.21, and after the millions of openSUSE users run 13.2 through the grinder, it should come out pretty stable by years end.