*safer as in journaled or CoW (or something) to get the data back when errors knock the fs out of whack.
Trust me, the one thing reiser4 is really good at is compressing zeroes. :)
In fact, for a random-write workload, CoW is pretty much the ideal file system layout, because it turns random writes into sequential ones.
" real-world benchmarks " == oxymoron
I, and anyone ithink, will agree using zeros is not 'real-world' of course but nevertheless it is a baseline, which ithink is what the author/tester was aiming to get(despite the tests being run on unstable kernel, unstable btrfs and ext4( with, IMO, dubious stability ).
Maybe the compression test(s), at least, could be better. It would be, ithink, more constructive to suggest how to achieve something closer to end-user(desktop & server) usage rather than waste BW discussing effectively dead or old fs that have neither journal or CoW safety nets.
I'm Chris Mason, one of the btrfs developers. Thanks for taking the time to benchmark these filesystems!
Someone forwarded me the iozone parameters used, and it looks like they have iozone doing 1K writes, which is less than the linux page size (4k on x86,x86-64 systems).
One way that btrfs is different from most other filesystems is that we never change pages while data is being written to the disk. When the application is doing 1k writes, each page is modified 4 times.
If the kernel decides to write the page somewhere in the middle of those four writes, ext4 will just change the page while it is being written. This happens often as the kernel tries to find free pages by writing dirty pages.
Btrfs will wait for the write to complete, and then because btrfs does copy on write, it will allocate a new block for the new write and write to the new location. This means that we are slow because we're waiting for writes and we're slow because we fragment the file more.
On my test machine switching from 1k writes to 4k writes increases btrfs write tput from 72MB/s to 85MB/s.
Numbers from another tester, all btrfs:
iozone -r1 (1k writes) 20MB/s
iozone -r4 (4k writes) 64MB/s
iozone -r64 (64k writes) 84MB/s
In practice, most people doing streaming writes like this use much larger buffer sizes (1MB or more). They often also use O_DIRECT.
[QUOTE=sektion31;108954]oh thanks for clarification. i read that reiser4 and btrfs are more similar to each other than to ext3/4. so i assumed they have a similar design idea.
Just to clarify, the big thing that I took from reiserfs (actually reiserv3, which was the one I worked on) was the idea of key/item storage. The btrfs btree uses a very similar key structure to order the items in the btree.
This is different from ext* which tend to have specialized block formats for different types of metadata. Btrfs just tosses things into a btree and lets it index it for searching.
I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
Thank you for taking the time for development!
I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
/usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls
(8,388,608 kB file)
Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.
I was a bit short on time then so just now ran it with same IOzone parameters but using the 9211's "Integrated RAID" RAID-0 on a different kernel.
File size set to 8388608 KB
Record Size 64 KB
Machine = Linux sm.linuXwindows.hom 188.8.131.52-desktop-1mnb #1 SMP Tue Dec 8 15: Excel chart generation enabled
Excel chart generation enabled
Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31]_[btrfs]_[9211-8i_RAID-0].xls
Output is in Kbytes/sec :
As you can see, same thing: WRITE is faster than READ even on the IR RAID.
Something weird is going on ... Perhaps it is an IOzone & btrfs issue.(?) If so IOzone tests are skewed (... the wrong way, :)). I'd blame it on this test but none of the other fs had faster WRITEs than READs in the results.
I have not tried it on Intel Nehalem platform yet but thought maybe you should know something odd was occurring (that is not exhibited by the other fs).
I don't need an explanation or anything like that but would be good to know you got the post if you have the time. I do have the excel files if needed.
PS: This is not the first time I found md to be faster than a HBA or RAID card's RAID. Distressing but also very good for us Linux geeks. ...wish it(md) was cross-platform.
Btrfs does crcs after reading, and sometimes it needs a larger readahead window to perform as well as the other filesystems. You could confirm this by turning crcs off (mount -o nodatasum).
Linux uses a bdi (backing dev info) to collect readahead and a few other device statistics. Btrfs creates a virtual bdi so that it can easily manage multiple devices. Sometimes it doesn't pick the right read ahead values for faster raid devices.
In /sys/class/bdi you'll find directories named btrfs-N where N is a number (1,2,3) for each btrfs mount. So /sys/class/bdi/btrfs-1 is the first btrfs filesystem. /sys/class/bdi/btrfs-1/read_ahead_kb can be used to boost the size of the kernel's internal read ahead buffer. Triple whatever is in there and see if your performance changes.
If that doesn't do it, just let me know. Most of the filesystems scale pretty well on streaming reads and writes to a single file, so we should be pretty close on this system.
Thanks for the explanation and suggestion.
Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
The results were strikingly different as before but more so:
Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)
Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_184.108.40.206]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls
Still slower READ but not nearly as dramatic.
The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_220.127.116.11]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls
The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
Last week's test,
iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
It is a bit slower too for READ ... but no drama.
Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
On the up side, man, look at those numbers. The btrfs just walloped ext4 for this test! :)
That 490,296 kBps is the fastest I've ever seen here for a WRITE. By all means, please keep up the good work!
I'll look at the buffering but with the 9211 HBA there's not much to do for it. Perhaps, the buffering with the disks' cache got turned off between Linux and MS OS somehow. It should not as it is an adapter setting but the LSI2008 and LSI2108 kernel module(mpt2sas) is relatively new. ... it'll take a while to get the SW running to find out.
On raid cards with writeback cache (and sometimes even single drives with writeback cache), the cache may allow the card to process writes faster than it can read. This is because the cache gives the drive the chance to stage the IO and perfectly order it, while reads must be done more or less immediately. Good cards have good readahead logic, but this doesn't always work out.
So, now that we have the kernel readahead tuned (btw, you can try larger numbers in the bdi read_ahead_kb field), the next step is to make sure the kernel is using the largest possible requests on the card.
cd /sys/block/xxxx/queue where xxxx is the device for your drive. You want the physical device, and if you're using MD you want to do this to each drive in the MD raid set (example cd /sys/block/sda/queue)
echo deadline > scheduler
echo 2048 > nr_requests
cat max_hw_sectors_kb > max_sectors_kb
Switching to deadline may or may not make a difference, the others are very likely to help.