Announcement

**Xilanaz** · 24 October 2012, 01:33 PM

Here's a response with some more comments from one of the affected users for this EXT4 data corruption bug in the stable kernel.

that the responce from the user who reported it, as far as I can tell from that treath its the only user.

That's where things are at right now in the mailing list thread for this serious EXT4 data corruption issue that reached the stable Linux kernel.

it happens if you mount/umount really fast at least 2 times in a row and when in both cases the journal is not empty and before the 2nd mount no fsck has run, I really doubt many of the normal users will ever encounter this.

**NullNix** · 24 October 2012, 01:36 PM

Not quite right.

This article is more than slightly inaccurate, I'm afraid. (I'm the poor bloody reporter originally afflicted by this, though it seems it has hit someone else's USB key. I'd rather have a USB affected than my home directory, I must say, but one does not choose such things.)

- Ted did not 'bisect' the kernel to find the bug; he looked at the ext4-affecting patches in the affected range to find those that seemed suspicious. Indeed, presently his fix is based on an educated guess, and is not known to work (or not to work).

- The new patch is at least in part a *debugging* patch, and may or may not work: I have to get up the gumption to reboot into it and risk my home directory again before it's possible to say anything one way or the other.

There is definitely a bug; Ted may have fixed it; all further is unknown at present.

(The vicious attacks on various people for allowing this bug through are wholly unjustified. If this bug really does affect you only after multiple mounts, it's quite hard to spot in most regression test mechanisms, particularly given that the corruption only affects lightly-loaded, not heavily-loaded nor idle filesystems, and only seemingly a subset of those. It's a tricky bug. Bugs happen. No blame attaches.)

**NullNix** · 24 October 2012, 01:39 PM

Originally posted by Xilanaz View Post

that the responce from the user who reported it, as far as I can tell from that treath its the only user.

There is at least one other person who's seen it, on a USB key rather than his home directory. 3.6.3 has only been out for a couple of days: there hasn't been time for many people to have run into it!

it happens if you mount/umount really fast at least 2 times in a row and when in both cases the journal is not empty and before the 2nd mount no fsck has run, I really doubt many of the normal users will ever encounter this.

You just need two dirty shutdowns and at least one lightly-loaded filesystem. Since the fs is marked clean when this bug hits, the filesystem will not normally be fscked...

**energyman** · 24 October 2012, 01:52 PM

emm.. 3.4 is NOT nearing end-of-life. It is a long-term-support kernel. So it will hang around for a while.

**Pallidus** · 24 October 2012, 01:54 PM

"This article is more than slightly inaccurate,"

welcome to phoronix

uname -r
3.6.1-1.fc18.x86_64
[dixnasty@localhost ~]$

suck it bitches

**tytso** · 24 October 2012, 01:59 PM

It looks like my original analysis may not have been correct. At least, Eric and I haven't been able to figure out a way to trigger the problem based on my hypothesis of what had been going wrong. Still, the commit in question *does* change things, and so it's still the most likely culprit. (There were no ext4-related changes between v3.6..v3.6.1 and v3.6.2..v3.6.3, and I've looked at all of the changes between v3.6.2 and v3.6.3; all of the other changes look innocuous.) I have a patch (sent around 1:23 am Eastern on Wed., Oct. 24th to the ext4 list on the relevant mail thread) which should revert the problematic change in behavior, as well as put it a check which looks for the original conditions which might have triggered the patch, and prints a warning plus a stack trace so we can really understand what is going on. I don't want to consider this fixed until we have a reproduction case, so we can state with 100% certainty that we understand how it was triggered, and so we know that the proposed patch really does fix things.

That being said, please note that Fedora 17 is apparently on 3.6.2, and so far we only have two users who have reported the problem (or more specifically, both have reproduced file system corruptions with very similar symptoms, one running v3.6.2 and one running v3.6.3). The fact that they have reported the problem on very different hardware (one using a USB stick, the other using a Software RAID-5 setup), means it's not likely a hardware induced problem. However, this could potentially just be bad luck, since the fs corruption that was reported could have been explained by a random hardware glitch. With two users reporting it, though we have to treat it as potentially a real bug, and so I've gone back and re-audited all of the ext4 related commits that went into the v3.6.x stable kernel series.

If you think you have a related, similar bug, please check which kernel version you are using, and get the EXT4 error messages from the syslogs, and report it to me and the ext4 list. And if you can reproduce it reliably, I definitely want to hear from you. :-)

Thanks!!

-- Ted

**NullNix** · 24 October 2012, 02:03 PM

Originally posted by tytso View Post

one using a USB stick, the other using a Software RAID-5 setup),

You misspoke I think: it's hardware RAID-5

(an Areca ARC-1210 caching RAID controller). This is a good thing from my perspective, since it removes md from consideration.

Rebooting to test your warning patch (and Trond's unrelated NFS lockd crash patch which caused the frequent reboots in the first place) shortly.

However, this could potentially just be bad luck, since the fs corruption that was reported could have been explained by a random hardware glitch.

Not likely. It happens with 3.6.3 and never with 3.6.1, on three-year-old hardware that has never experienced fs-related problems before. Hardware bugs are rarely so specific, unless the changes are themselves tickling a hardware bug!

**necro-lover** · 24 October 2012, 02:08 PM

haha im still using 3.6.0 RC6!!!!

LOL beta/alpha kernel is more stable than the stable kernel!

**Pallidus** · 24 October 2012, 02:14 PM

Originally posted by necro-lover View Post

haha im still using 3.6.0 RC6!!!!

LOL beta/alpha kernel is more stable than the stable kernel!

LOL is on you because you could be running stable 3.6.1 and be unnafected as well.

PROTIP wait for the kernels to mature, even the stable ones, for at least 15 days before upgrading them PROTIP

"Still, the commit in question *does* change things, and so it's still the most likely culprit."

name and shame plox

**mazumoto** · 24 October 2012, 02:22 PM

awww, on stable ? that's evil. last time i lost data on ext4 was on rc-kernels at least ...
glad I'm on btrfs now for all my fs's, although that'll probably blow up any second now :-P

Announcement

EXT4 Data Corruption Bug Hits Stable Linux Kernels

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment