Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 30

Thread: Btrfs Battles EXT4 With The Linux 2.6.33 Kernel

  1. #11
    Join Date
    May 2007
    Location
    Third Rock from the Sun
    Posts
    6,582

    Default

    Quote Originally Posted by intgr View Post
    Ironically enough, compressed reiser4 would blow everything else out of the water in these benchmarks.
    Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.

  2. #12
    Join Date
    Jan 2010
    Location
    NE AR
    Posts
    22

    Default

    Quote Originally Posted by intgr View Post
    [...] compressed reiser4 would blow everything else out of the water in these benchmarks.
    None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest but who wants to race boats w/o at least a lifejacket so when one gets tossed into the water, at least there's a chance of survival.


    *safer as in journaled or CoW (or something) to get the data back when errors knock the fs out of whack.

  3. #13
    Join Date
    Dec 2009
    Posts
    18

    Default

    Quote Originally Posted by deanjo View Post
    Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.
    I'm not talking about real-world benchmarks, I'm talking about these synthetic benchmarks that Phoronix used in this article, that only write a bunch of zeroes to the disk. They just aren't adequate for benchmarking compressed file systems (reiser4 nor btrfs).

    Trust me, the one thing reiser4 is really good at is compressing zeroes.
    Quote Originally Posted by fhj52 View Post
    None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest
    (Even though you completely missed my point.) You would think so, but most of the time it's not actually true. Modern journaling file systems are much better tuned than old unsafe file systems (ext2, UFS).

    In fact, for a random-write workload, CoW is pretty much the ideal file system layout, because it turns random writes into sequential ones.

  4. #14
    Join Date
    Jan 2010
    Location
    NE AR
    Posts
    22

    Default

    " real-world benchmarks " == oxymoron



    I, and anyone ithink, will agree using zeros is not 'real-world' of course but nevertheless it is a baseline, which ithink is what the author/tester was aiming to get(despite the tests being run on unstable kernel, unstable btrfs and ext4( with, IMO, dubious stability ).

    Maybe the compression test(s), at least, could be better. It would be, ithink, more constructive to suggest how to achieve something closer to end-user(desktop & server) usage rather than waste BW discussing effectively dead or old fs that have neither journal or CoW safety nets.

    ...

  5. #15

    Default iozone write parameters

    I'm Chris Mason, one of the btrfs developers. Thanks for taking the time to benchmark these filesystems!

    Someone forwarded me the iozone parameters used, and it looks like they have iozone doing 1K writes, which is less than the linux page size (4k on x86,x86-64 systems).

    One way that btrfs is different from most other filesystems is that we never change pages while data is being written to the disk. When the application is doing 1k writes, each page is modified 4 times.

    If the kernel decides to write the page somewhere in the middle of those four writes, ext4 will just change the page while it is being written. This happens often as the kernel tries to find free pages by writing dirty pages.

    Btrfs will wait for the write to complete, and then because btrfs does copy on write, it will allocate a new block for the new write and write to the new location. This means that we are slow because we're waiting for writes and we're slow because we fragment the file more.

    On my test machine switching from 1k writes to 4k writes increases btrfs write tput from 72MB/s to 85MB/s.

    Numbers from another tester, all btrfs:

    iozone -r1 (1k writes) 20MB/s
    iozone -r4 (4k writes) 64MB/s
    iozone -r64 (64k writes) 84MB/s

    In practice, most people doing streaming writes like this use much larger buffer sizes (1MB or more). They often also use O_DIRECT.

    -chris

  6. #16

    Default

    [QUOTE=sektion31;108954]oh thanks for clarification. i read that reiser4 and btrfs are more similar to each other than to ext3/4. so i assumed they have a similar design idea.

    Just to clarify, the big thing that I took from reiserfs (actually reiserv3, which was the one I worked on) was the idea of key/item storage. The btrfs btree uses a very similar key structure to order the items in the btree.

    This is different from ext* which tend to have specialized block formats for different types of metadata. Btrfs just tosses things into a btree and lets it index it for searching.

    -chris

  7. #17
    Join Date
    Jan 2010
    Location
    NE AR
    Posts
    22

    Default

    Hi Chris,
    I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
    Thank you for taking the time for development!

    I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
    :

    /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

    (8,388,608 kB file)
    Writer Report
    421,398 kBps

    Re-Writer Report
    424,017 kBps

    Reader Report
    321,558 kBps

    Re-Reader Report
    324,612 kBps

    Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.

    I was a bit short on time then so just now ran it with same IOzone parameters but using the 9211's "Integrated RAID" RAID-0 on a different kernel.
    :
    File size set to 8388608 KB
    Record Size 64 KB
    Machine = Linux sm.linuXwindows.hom 2.6.31.6-desktop-1mnb #1 SMP Tue Dec 8 15: Excel chart generation enabled
    Excel chart generation enabled
    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31]_[btrfs]_[9211-8i_RAID-0].xls
    Output is in Kbytes/sec :

    "Writer report"
    "64"
    "8388608" 412,433

    "Re-writer report"
    "64"
    "8388608" 417,586

    "Reader report"
    "64"
    "8388608" 391,542

    "Re-Reader report"
    "64"
    "8388608" 393,962

    As you can see, same thing: WRITE is faster than READ even on the IR RAID.
    Something weird is going on ... Perhaps it is an IOzone & btrfs issue.(?) If so IOzone tests are skewed (... the wrong way, ). I'd blame it on this test but none of the other fs had faster WRITEs than READs in the results.

    I have not tried it on Intel Nehalem platform yet but thought maybe you should know something odd was occurring (that is not exhibited by the other fs).

    I don't need an explanation or anything like that but would be good to know you got the post if you have the time. I do have the excel files if needed.

    -Ric

    PS: This is not the first time I found md to be faster than a HBA or RAID card's RAID. Distressing but also very good for us Linux geeks. ...wish it(md) was cross-platform.

  8. #18

    Default

    Quote Originally Posted by fhj52 View Post
    Hi Chris,
    I'm Ric, one of the users excited about the btrfs fs(as geeky as that is).
    Thank you for taking the time for development!

    I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
    :

    /usr/lib/iozone/bin/iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /SAS600RAID/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[md0-RAID0]_[btrfs]_[9211-8i].xls

    (8,388,608 kB file)
    Writer Report
    421,398 kBps

    Re-Writer Report
    424,017 kBps

    Reader Report
    321,558 kBps

    Re-Reader Report
    324,612 kBps

    Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.
    Thanks for giving btrfs a try. Usually when read results are too low it is because there isn't enough read ahead being done. The two easy ways to control readahead are to use a much larger buffer size (10MB for example) or to tune the bdi parameters.

    Btrfs does crcs after reading, and sometimes it needs a larger readahead window to perform as well as the other filesystems. You could confirm this by turning crcs off (mount -o nodatasum).

    Linux uses a bdi (backing dev info) to collect readahead and a few other device statistics. Btrfs creates a virtual bdi so that it can easily manage multiple devices. Sometimes it doesn't pick the right read ahead values for faster raid devices.

    In /sys/class/bdi you'll find directories named btrfs-N where N is a number (1,2,3) for each btrfs mount. So /sys/class/bdi/btrfs-1 is the first btrfs filesystem. /sys/class/bdi/btrfs-1/read_ahead_kb can be used to boost the size of the kernel's internal read ahead buffer. Triple whatever is in there and see if your performance changes.

    If that doesn't do it, just let me know. Most of the filesystems scale pretty well on streaming reads and writes to a single file, so we should be pretty close on this system.

    -chris

  9. #19
    Join Date
    Jan 2010
    Location
    NE AR
    Posts
    22

    Default

    Hi Chris,
    Thanks for the explanation and suggestion.
    Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
    The results were strikingly different as before but more so:
    "Writer report"
    "64"
    "8388608" 244679

    "Re-writer report"
    "64"
    "8388608" 231935

    "Reader report"
    "64"
    "8388608" 51755

    "Re-Reader report"
    "64"
    "8388608" 50160

    Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

    "Writer report"
    "64"
    "8388608" 490296

    "Re-writer report"
    "64"
    "8388608" 470194

    "Reader report"
    "64"
    "8388608" 462138

    "Re-Reader report"
    "64"
    "8388608" 458668

    Still slower READ but not nearly as dramatic.

    The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

    "Writer report"
    "64"
    "8388608" 232943

    "Re-writer report"
    "64"
    "8388608" 230301

    "Reader report"
    "64"
    "8388608" 52251

    "Re-Reader report"
    "64"
    "8388608" 51795

    ...
    The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
    Last week's test,
    iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
    :
    Writer Report
    64
    8388608 229468
    Re-writer Report
    64
    8388608 233403
    Reader Report
    64
    8388608 208436
    Re-reader Report
    64
    8388608 210758

    It is a bit slower too for READ ... but no drama.
    Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
    ...

    On the up side, man, look at those numbers. The btrfs just walloped ext4 for this test!
    That 490,296 kBps is the fastest I've ever seen here for a WRITE. By all means, please keep up the good work!


    I'll look at the buffering but with the 9211 HBA there's not much to do for it. Perhaps, the buffering with the disks' cache got turned off between Linux and MS OS somehow. It should not as it is an adapter setting but the LSI2008 and LSI2108 kernel module(mpt2sas) is relatively new. ... it'll take a while to get the SW running to find out.


    -Ric

  10. #20

    Default

    Quote Originally Posted by fhj52 View Post
    Hi Chris,
    Thanks for the explanation and suggestion.
    Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
    The results were strikingly different as before but more so:
    "Writer report"
    "64"
    "8388608" 244679

    "Re-writer report"
    "64"
    "8388608" 231935

    "Reader report"
    "64"
    "8388608" 51755

    "Re-Reader report"
    "64"
    "8388608" 50160

    Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[9211-8i_RAID-0]-[nodatasum_12288_readahead].xls

    "Writer report"
    "64"
    "8388608" 490296

    "Re-writer report"
    "64"
    "8388608" 470194

    "Reader report"
    "64"
    "8388608" 462138

    "Re-Reader report"
    "64"
    "8388608" 458668

    Still slower READ but not nearly as dramatic.

    The MegaRAID mount was also changed [ PAS320RAID0 type btrfs (rw,noatime,nodatasum) ], but the results did not show improvement. WRITE is still testing as ~4x faster.
    Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31.12]_[btrfs]_[PAS320_MEGARAID-0]-[nodatasum_12288_readahead].xls

    "Writer report"
    "64"
    "8388608" 232943

    "Re-writer report"
    "64"
    "8388608" 230301

    "Reader report"
    "64"
    "8388608" 52251

    "Re-Reader report"
    "64"
    "8388608" 51795

    ...
    The adapters are a lot different. The MegaRAID is a RAID card for U320 PAS w/ a large cache & BBU while the 9211 is an HBA for SAS2(6Gbps) interface w/ no cache or BBU. I'd like to say it is the adapters and SAS-vs-SCSI but ext4 results indicate otherwise.
    Last week's test,
    iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/linux/PAS320RAID0/iozoneTESTFILE -Rb /tmp/iozone_[openSUSE_2.6.32]_[320-2x_RAID0]_[ext4]-2.xls
    :
    Writer Report
    64
    8388608 229468
    Re-writer Report
    64
    8388608 233403
    Reader Report
    64
    8388608 208436
    Re-reader Report
    64
    8388608 210758

    It is a bit slower too for READ ... but no drama.
    Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
    ...

    -Ric
    Thanks for trying this out. nodatasum will improve both writes and reads because it isn't doing the checksum during the write.

    On raid cards with writeback cache (and sometimes even single drives with writeback cache), the cache may allow the card to process writes faster than it can read. This is because the cache gives the drive the chance to stage the IO and perfectly order it, while reads must be done more or less immediately. Good cards have good readahead logic, but this doesn't always work out.

    So, now that we have the kernel readahead tuned (btw, you can try larger numbers in the bdi read_ahead_kb field), the next step is to make sure the kernel is using the largest possible requests on the card.

    cd /sys/block/xxxx/queue where xxxx is the device for your drive. You want the physical device, and if you're using MD you want to do this to each drive in the MD raid set (example cd /sys/block/sda/queue)

    echo deadline > scheduler
    echo 2048 > nr_requests
    cat max_hw_sectors_kb > max_sectors_kb

    Switching to deadline may or may not make a difference, the others are very likely to help.

    -chris

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •