first, the bonnie++ benchmark is nonsense. I downloaded the benchmark suite, and
pts/test-resources/bonnie/install.sh makes a bonnie script that will run
-s controls the size of the big file used in sequential write/rewrite/read and lseek tests, and has no impact on the multiple file creation/read/deletion test. The defaults for that are -n 10:0:0:0, IIRC. That means bonnie++ will create 10 * 1024 empty files in the scratch directory. This mostly tests the kernel's in-memory cache structures, since that's not big enough to fill up the memory, so you're not waiting for anything to happen on disk. The deletion does have to happen on disk for anything that made it to disk before being deleted, which can be a bottleneck.
./bonnie_/sbin/bonnie++ -d scratch_dir/ -s $2 > $LOG_FILE 2>&1" > bonnie
-n 30:50000:200:8 would be a more interesting test, probably. (file sizes between 50kB (not kiB) and 200B, 30*1024 files spread over 8 subdirs)
A few people have pointed out that XFS has stupid defaults, but nobody posted a good recommendation. I've played with XFS extensively and benchmarked a few different kinds of workloads on HW RAID5 and on single disks. And I've been using it on my desktop for several years now. For general purpose use, I would recommend:
lazy-count: don't keep the counters in the superblock up to date all the time, since there's enough info elsewhere. fewer writes = good.
mkfs.xfs -l lazy-count=1,size=128m -L yourlabel /dev/yourdisk
mount with -o noatime,logbsize=256k (put that in /etc/fstab)
-l size=128m: XFS likes to have big logs, and this is the max size.
mount -o logbsize=256k: That's log buffer size = 256kiB (of kernel memory). The default (and max with v1 logs) is 32kiB. This makes a factor of > 2 performance difference on a lot of small-file workloads. I think logbufs=8 has a similar effect (the default is 2 log bufs of size 32k. I haven't tested logbus=8,logbsize=256k. The XFS devs frequently recommend to people asking about perf tuning on the mailing list that they use logbsize=256k, but they don't mention increasing logbufs too.
If you have an older mkfs.xfs, get the latest xfsprogs, 2.10.1 has better defaults for mkfs (e.g. unless you set RAID stripe params, agcount=4, which is about as much parallelism as a single disk can give you anyway. The old default was much higher agcount, which could slow down when the disk started to get full.)
Or just use your old mkfs.xfs and specify agcount:
mkfs.xfs -l lazy-count=1,size=128m -L label /dev/disk -d agcount=4 -i attr=2
If you want to start tuning, read up on XFS a bit. http://oss.sgi.com/projects/xfs/ (unfortunately, there's no good tuning guide anywhere obvious on the web site). Read the man page for mkfs.
You can't change the number of allocation groups without a fresh mkfs, but you can enable version 2 logs, and lazy-count, without mkfs. xfs_admin -j -c1 will switch to v2 logs with lazy-count enabled. xfs_growfs says growing the log size isn't supported, which is a problem if you have less than the max size of 128MB, since XFS loves large logs. It lets it have more metadata ops on the fly, instead of being forced to write them out sooner.
if your FS is bigger than 1TB, you should mount with -o inode64, too. Note that contrary to the docs, noikeep is the default. I checked the kernel sources, and that's been the case for a while, I think. Otherwise I would recommend using noikeep to reduce fragmentation.
If you're making a filesystem only a couple GB, like a root fs, a 128MB log will take a serious chunk of the available space. You might be better of with JFS. I'm currently benchmarking XFS with tons of different option combinations for use as a root fs... (XFS block size, and log size, lazy-count=0/1, mount -o logbsize=, and block dev readahead and io elevator)
I use LVM for /usr, /home, /var/tmp (includes /var/cache and /usr/local/src), so my root FS currently is a 1.5GB JFS filesystem that is 54% full. It's on a software RAID1.
Since I run Ubuntu, my /var/lib/dpkg/info has 9373 files out of the total 20794 regular files (27687 inodes) on the filesystem, most of them small.
find / -xdev -type f -ls | sort -n -k7 | less -S
then look at the % in less's status line. or type 50% to go to 50% of the file position.
<= 1k: 45%
<= 2k: 52%
<= 3k: 58% (mostly /var/lib/dpkg/info)
<= 4k: 59%
<= 6k: 62%
<= 8k: 64%
<= 16k: 71% (a lot of kernel modules...)
<= 32k: 85%
<= 64k: 93%
<= 128k: 96%
> 1M: 0.2% (57 files)
(I started doing this with find without -type f, and there are lots of small directories (that don't need any blocks outside the inode): < 1k: 59%; < 2k: 64%; < 3k: 68%)
Every time dpkg upgrades a package, or I even run dpkg -S, it reads /var/lib/dpkg/info/*.list (and maybe more). (although dlocate usually works as a replacement for dlocate -S). This usually takes several seconds when the cache is cold on my current JFS filesystem that I created ~2 years ago when I installed the system. This is what I notice as slow on my root filesystem currently. JFS is fine with hot caches, e.g. for /lib, /etc, /bin, and so on. But dpkg is always very slow the first time.
Those small files are probably pretty scattered now, and probably not stored in anything like readdir() order or alphabetical order. I'm hoping XFS will do better than JFS at keeping down fragmentation, although it probably won't. It writes files created at the same time all nearby (it actually tries to make contiguous writes out of dirty data). It doesn't look at where old files in the same directory are stored when trying to decide where to put new files, AFAIK. So I'll probably end up with more scattered files. At least with XFS's batched writeout, mkdir info.new; cp -a info/* info.new; mv ... ; rm -r ...; will work to make a defragged copy of the directory and files in it. (to just defrag the directory, mkdir info.new; ln info/* info.new/; That can make readdir order = alphabetical order. Note using *, which expands to a sorted list, instead of using just cp -a, which will operate in readdir order. dpkg doesn't read in readdir order, it goes (mostly?) alphabetically by package name (based on its status file).)
Anyway, I'm considering using a smaller data block size, like -b size=2k or size=1k, (but -n size=8k, I definitely don't want smaller blocks for directories. There are a lot of tiny directories, but they won't waste 8k because there's room in the inode for their data. See directory sizes with e.g. ls -ld. Larger directory block sizes help to reduce directory fragmentation. And most of the directories on my root filesystem that aren't tiny are fairly large. xfs_bmap -v works on directories, too, BTW). XFS is extent-based, so a small block size doesn't make huge block bitmaps even for large files.
I think I was finding that smaller data block sizes were using more CPU than the default 4k (=max=page size) in hot-cache situations. I compared some results I've already generated, and 1k or 2k does seem slightly faster for untarring the whole FS; drop_caches; tar c | wc -c (so stat+read) ; drop_caches; untar again (overwrite); drop_caches; read some more, timing each component of that. My desktop has been in single-user mode for 1.5 days testing this. I should post my results somewhere when I'm done... And I need to find a good way to explore the 5 (or higher) dimensional data (time as a function of block size, log size, logbuf size, lazy-count=0/1, and deadline vs. cfq, and blockdev --setra 256, 512, or 1024 if I let my tests run that long...).
BTW, JFS is good, and does use less CPU. That won't reduce CPU wakeups to save power, though. FS code mostly runs when called by processes doing a read(2), or open(2), or whatever. Filesystems do usually start a thread to do async tasks, though. But those threads shouldn't be waking up at all when there's no I/O going on.
I decided to use JFS for my root FS a couple years ago after reading
http://www.sabi.co.uk/blog/anno05-4th.html#051226b. I probably would have used XFS, but I hadn't realized that to work around the grub-install issue you just have boot grub from a USB stick or whatever, and type root (hd0,0); setup (hd0). I recently set up a bioinformatics cluster using XFS for root and all other filesystems. It works fine, except that getting GRUB installed is a hassle.
Also BTW, there's a lot of good reading on www.sabi.co.uk. e.g. suggestions for setting up software RAID, http://www.sabi.co.uk/blog/0802feb.html#080217, and lots of filesystem stuff:
XFS is wonderful for large files, and has some neat other features. If you download torrents, you usually get fragmented files because they start sparse and are written in the order the blocks come in. xfs can preallocate space without actually writing it, so you end up with a minimally-fragmented file. azureus has an option to use xfs_io's resvsp command. Linux now has an fallocate(2) command which should work for XFS and ext4. posix_fallocate(3) should use it. I'm not sure if fallocate is actually implemented for xfs yet, but I would hope so since its semantics are the same. And I don't know what glibc version includes an fallocate(2) backend for posix_fallocate(3).
And xfs has nice tools, like xfs_bmap to show you the fragmentation of any file.