Yeah, while the phoronix test suite framework itself is fine, the choice of benchmarks is very questionable at best.
Let's have a look at the "popular" C-Ray 1.1 benchmark. It can be downloaded from http://www.phoronix-test-suite.com/b...ray-1.1.tar.gz
It is typically run as "./c-ray-mt -t 32 -s 1600x1200 -r 8 -i sphfract -o output.ppm", but changing 1600x1200 to 160x120 lets it run for seconds instead of hundreds of seconds on ARM. Profiling of gcc-4.7.0 compiled code shows the following:
Code:
./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm
samples % image name symbol name
28459 51.8672 c-ray-mt shade
17869 32.5667 c-ray-mt ray_sphere
4110 7.4906 c-ray-mt trace
3185 5.8047 c-ray-mt render_scanline
319 0.5814 libm-2.13.so __ieee754_pow
194 0.3536 libm-2.13.so powl
136 0.2479 libm-2.13.so __exp1
108 0.1968 libc-2.13.so memcpy
78 0.1422 c-ray-mt get_primary_ray
73 0.1330 c-ray-mt get_sample_pos
59 0.1075 libm-2.13.so isnanl
42 0.0765 vmlinux __do_softirq
36 0.0656 vmlinux __schedule
35 0.0638 libm-2.13.so checkint
31 0.0565 libc-2.13.so fputc
18 0.0328 libm-2.13.so __mul
4 0.0073 c-ray-mt main
And this reveals a major performance problem: function calls overhead is insane. Just making sure that ray_sphere function gets inlined improves performance significantly. As a workaround, -finline-limit=100000 option can be added for more aggressive inlining. The results of "./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm" on ARM Cortex-A9 1.2GHz compiled with gcc 4.7.0:
Rendering took: 6 seconds (6685 milliseconds) for CFLAGS="-O3 -ffast-math"
Rendering took: 5 seconds (5436 milliseconds) for CFLAGS="-O3 -ffast-math -finline-limit=100000"
But the real fix is to use "static inline" for the performance critical functions. The one who developed this C-Ray application apparently has no clue about performance optimizations. Or maybe it was done on purpose to make the job harder for the compilers. The compilers, which are configured to use aggressive inlining by default are going to win by a huge margin on this test (trading it for larger binary sizes because there are no free cookies).
Generally, I get an impression that such selection of phoronix benchmarks has been done on purpose. Surely, when having compiler optimizations disabled or benchmarking poorly written code such as C-Ray, the difference between the results from different compilers may be quite significant (and mostly random). Benchmarking properly written code with properly selected optimization options is surely boring, because it is less likely to show surprising wins or sensations