"XFS, NTFS, ext3, ReiserFS and JFS have. . . failure policies that are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures."
I would not want to put my data on a fast, but not reliable storage solution. Data safety is the most important for me. Not speed.
There are plenty of cases where speed might be more important than data integrity. Look at Google, for instance. They might not care if a rare (and they really must be rare, of course) error occurs as long as they can get superior speed out of the filesystem. After all, they hopefully have data integrity anyway across many different disks in their network, and they're constantly updating the data anyway which would remove any temporary errors. While a problem in speed directly effects the user experience and requires Google to spend more money on hardware to make up for it.
Not that I'm saying that's typical, just that sweeping statements like you tend to make can be just as wrong. No single scenario covers 100% of people.
Why would you want to run a fast filesystem? I dont get it.
The most important thing for a filesystem is data integrity.
For most people, yes. But some big setups find performance more important than the reliability of a single disk, achieving integrity through massive redundancy of the data.
That's not to say that an outright-flaky filesystem is acceptable. But as long as you know when failures occur and they don't occur often, it may be cheaper to tolerate those failures than to use a different filesystem that never loses data but is significantly slower.
so - why is nobody using reiser4? It is fast AND cares about data.
And why is no other fs able to do the same? Why the same shit with extX and btrfs? A filesystem that might lose data WHICH ALREADY WAS ON THE PLATTER is braindead. And no matter what the devs write 'it was never guaranteed'... FUCK YOU.
Just look at the btrfs faq. A simple rename can result in two empty files. BULLSHIT.
Why don't you look at the graph in the linked blog post? There isn't much difference between ext4+journal and ext4-journal.
Yes, XFS is still faster in the tested scenario. The point is that EXT4 got a huge speed boost on systems with many cores/threads and gets close to XFS, which had previously been all alone in that segment.
If people are fixated on "win-loss" graphs, see the "random write" and "mail server" workloads, where ext4 actually does better than XFS. But really, I'm not seeing this primarily as an ext4 vs. XFS thing. If I had, I would have pointed at those graphs instead, and done the fanboy "Nyah, nyah" thing.
We benchmark ourselves against XFS as a mark of respect. XFS has been optimized for HPC workloads where they are often writing large files to large raid arrays from big systems. (For SGI, 48 cores is a small system.) So the "large file create" workload, on the given hardware configuration, is basically on XFS's home ground, and as that graphs shows, we still have more work to do.
Some people like to treat file system benchmarks as a competition, and want to score wins and losses. That's not the way I look at it. I hack file systems because I'm passionate about working on that technology. I'm more excited about how I can make ext4 better, and not whether I can "beat down" some other file system. That's not what it's all about.
What's the point of a journaling filesystem like ext4 without active journaling? And i think the difference with many threads is still huge.
There are two reasons why I asked Eric to benchmark ext4 in no journal mode.
First of all, if you are using ext4 as the object store for a cluster filesystem, where you might have hundreds of servers in your cluster file system, with perhaps thousands of disks, and where each file is composed of "shards" which are replicated on multiple servers for redundancy in case a server dies (maybe the hard drive craps out, or a power supply explodes, etc; when you have that many servers, the probability of some machine failing approaches 100%). In that scenario, the journal is overhead that's not worth the cost.
(Note by the way that the "large file creates" workload is not a metadata heavy workload; so the effect of the journal is not that pronounced. The "mail server" workload has a many more metadata changes per transaction, and we see a much more pronounced differences between the journal and no journal modes with 1 threads. The fact that differences fall off at 48 and 192 threads is because we still have scalability bottlenecks in the journal code that I still need to fix up.)
The second reason why I asked Eric to benchmark ext4 in journal and no journal mode is that it helps me to see where potential bottlenecks are in ext4, so I know what is most profitable to tackle next. The main thrust of my LCA talk is actually about how to decide what optimizations to do next to improve a kernel system's scalability. If you look at the three benchmark reports which Eric produced for the work that I did during 2.6.34, 2.6.35, and 2.6.36, the lockstat report showed me where the ext4 code was hitting bottlenecks, and that told me what I should do next in order to make ext4 more scalable.
It would be awfully nice if the Phoronix benchmarks actually gathered information using perf and lockstat, since that is actually what we kernel developers need to help see how to improve the benchmarks. In fact, the graphs are pretty, but they are not what we need to understand how we can improve the filesystem. The graphs are what ESPN shows when they do 15 second clips of receives catching touchdown passes; it may drive advertising revenue, but it doesn't help the football players improve their game. For that we need to study the game films, in slow motion, and from multiple angles, and we need to study a lot more than just the receive catching the touchdown pass. How the offensive and defensive linemen react to the play, etc., is far more important.
It is not about a disk crashes or something similar. It is about retrieving the same data you put on the disk. Imagine you put this data on disk: "1234567890" but a corruption occured so you got back "2234567890". And the hardware does not even notice the data got corrupted. This is called Silent Corruption and occurs all the time.
Now, imagine you have a fast filesystem but there is silent corruption now and then. You can NOT trust the data you get back. As I have shown in links, this happens to XFS, ReiserFS, JFS, ext3, etc. Even to hardware raid, it happens all the time.
CERN did a test, their 3.000 storage Linux servers showed 100s of instances of silent corruption. (CERN wrote a known bit pattern to the disks and compared the result - and they differed). CERN can not trust the data on disk. Therefore CERN are now migrating to ZFS (which actually is the only modern solution who is designed from scratch, to protect against silent corruption).
I dont get it, who wants to have a fast filesystem which gives you false data?