Sorry for the late response, I have just noticed this thread. I believe some results of this tests set are incorrect. I would like to comment them.
First 9 results look fine:
1. LZMA Compression - not really an I/O test, that could be seen from almost equal results.
2. Gzip compression - same.
3. Compile bench - not sure what exactly this test does, but OK. For non-threaded I/O NCQ may give slightly lower performance at the drive firmware level.
4. Postmark - 44% benefit under parallel load is normal for CAM ATA because of NCQ.
5. Unpacking kernel - unpacking is a single-threaded process with a lot of flushing. Small slowdown reason may be same as in 3.
6. Write in 8 threads - CAM with NCQ won a bit, OK
7/8. Write in 16/32 threads - increasing number of threads makes pattern more random, that penalizes legacy ATA, while NCQ in CAM probably compensates it.
9. Write in 32 threads by 128MB - I can't explain why results slightly better then in 8, but CAM with NCQ still wins.
But the rest are not good:
10. Random write in 8 threads - for random tests tiobench uses 4K blocks. None of desktop drives (and especially laptop ones) can do more then 200-300 random I/Os. As result, the best what this test should show is about 1MB/s. Instead we can see about 49MB/s in both cases. Explanation is trivial - all data fit into ZFS caches and were written almost sequentially on file close. This is just not a disk subsystem test.
11. Random write in 16 threads - due to increased active data set caching works worse. As result we can see lower speeds. Though speeds are still higher then possible, that means caching is still actively used.
12. Random write in 4 thread by 128MB - as I have said, 25MB/s with legacy ATA can't be explained by anything except caching. Random write in 4 threads just can't be faster then random write in 16 threads in 11. This result is wrong by definition. Most probably something affected cache hits ratio between tests.
13/14 Reading in 16 threads by 64MB and 256MB - the only reason why results of these two tests could be different is because of cache hits.
So my conclusion: these tests were not considering cache effects. If it was assumed intentionally - then it is at least not an ATA subsystems, but cache effectiveness comparison. If it happen accidentally - then these results just do not mean anything.
To additionally ground my point here is some of my benchmark results. It was done on i386 9-CURRENT with 2GB RAM. Such memory-limited condition was chosen intentionally to minimize cache effects and really compare disk subsystems.
To compare disk subsystems performance unrelated to file systems - here is benchmarks of legacy and CAM ATAs in random read, write and mixed I/O requests of different sizes to raw disk: http://people.freebsd.org/~mav/TEST.raidtest
Here you can see almost double speedup on read requests. Write requests do not benefit because it is already covered by enabled drive write cache.
2 Phoronix: In my tests I am always trying to validate and explain every aspect of result. Until you do the same in your reviews, they won't worth much.
PS: Note that this system was not really suitable for ZFS, so numbers can be compared only with special care and understanding. I had no goal to compare UFS and ZFS directly. They were made for completely different environments and each have own benefits and requirements.