It Turns Out The Btrfs RAID 5/6 Issue Isn't Completely Fixed

Written by Michael Larabel in Linux Storage on 19 November 2016 at 08:20 AM EST. 21 Comments
LINUX STORAGE
Earlier this week we reported on the Btrfs RAID5/RAID6 code being fixed, well, it appeared to. However, now the Btrfs developers have clarified that the situation isn't entirely resolved.

This all stems from the problem discovered months ago about the Btrfs RAID 5/6 code being found unsafe. Btrfs contributor Zygo Blaxell wrote to clarify the situation:
with headlines like "btrfs RAID5/RAID6 support is finally fixed" when that's very much not the case. Only one bug has been removed for the key use case that makes RAID5 interesting, and it's just the first of many that still remain in the path of a user trying to recover from a normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but it's important to always be clear and _complete_ when stating bug status because people quote statements out of context. When the article quoted the text

"it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo was the time bomb issue. It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the RAID5/6 code, but a separate issue caused by a race condition in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working as well as RAID1 on btrfs. To be clear, it's not--although the gap is now significantly narrower.
So pardon the confusion, the RAID5/6 code has improved, but not is all dandy.

He also commented, "There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem...To be able to use a RAID5 in production it must be possible to recover from one normal disk failure without being stopped by *any* bug in most cases. Until that happens, users should be aware that recovery doesn't work yet."

So for now it's probably just best not using Btrfs' native RAID 5/6 code in production.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week