View Full Version : Quick, overall system performance suite?
mendieta
04-10-2009, 06:48 PM
Hi
I am getting a new computer soon, and of course I am drooling about overclocking and stuff ;-)
Long story short, I installed PTS in my old machine to start playing. I think the easy access to global and how you can make comparisons with online results is incredibly cool and useful.
What I found lacking is in usability for someone who wants a quick test. Anyone used geekbench? It is a pleasure. Click, download, run, and you get a score for your machine. And you get in a minute or two in a slow machine. Granted, it is lacking disk and graphics performance, but I think we could use something like this. I have some ideas, and I am planning to post in the sticky thread for this forum (PTS). But I wonder if such a quick and to the point suite already exists? I saw a bunch of "sys performance" suites, but they involve 100Mb or more of downloads, and many dozen minutes of runtime. Am I missing something?
Thanks so much Michael and all for the great work!
Michael
04-10-2009, 07:01 PM
There is currently no system-quick suite, but if you propose a suite with what tests to include and such, I would be happy to make it. Or if you run: phoronix-test-suite build-suite you can create the suite and I would be glad to push it upstream.
mendieta
04-10-2009, 07:27 PM
Great, Michael. What I had in mind was a bit more involved. I think we could aggregate results with some sort of an average, maybe geometric mean:
http://en.wikipedia.org/wiki/Geometric_mean
What we would average is the score of each component:
CPU_S = Single Core CPU, indirectly measures memory speed/bandwidth
CPU_M = Multiple Core CPU, also measures memory
DISK = hard drive read/write performance.
GRAPH_2D = 2D Performance
GRAPH_3D = 3D Perfomance
Is something like this doable in this framework?
In this mixture you get contributions from the speed of each processor per se (useful when you run single threaded apps), multithreaded speed, and so for. I though memory is already included in the two CPU tests, so adding a memory test separately would give memory itself too much weight.
For tests measured as "lower is better", we would output as a score the inverse of this number. For instance, if the output is execution time, we would output 1 divided by that time (a "frequency").
Also, for the numbers to make sense, scores would need to be normalized to some benchmark machine. In that machine, the score is 1. Maybe a single core older machine you have hanging around ;-)
A natural byproduct would be to have a system-quick-cli with the first three contributions and a system-quick-gui with the other two.
Michael
04-10-2009, 07:58 PM
Ah, okay.
Yes, it can be done (well, it needs to be implemented within pts-core, but I should be able to fit it into PTS 2.0). Should all of them be weighted the same then? If we can start a discussion and get others involved in this thread to provide their thoughts and feedback, it would be great so we can settle for a fair, standard composite scoring system.
If you want to start by proposing some tests, that would be good, etc. Well, for CPU_M the best multicore CPU test in my opinion is graphics-magick. For DISK, IOzone is probably the best but that takes a while to run. So perhaps one of the compression suites.
As soon as it's settled for how the scoring should be, etc. I can then work on the needed support within the framework to offer this.
mendieta
04-10-2009, 09:25 PM
Thanks a lot Michael!
Yes, getting more people involved is important. Maybe you can make this thread sticky until this is settled?
Scoring: correct, we don't need in principle to change the weights. If we wanted, the natural way is to use exponents. For instance, if we want to make the disk twice as important, we would add the disk score to the quare, and then raise the whole thing to the power 1/6 instead of 1/5.
Composition: we can change the number of tests, what components are tested, etc.
The goal: something that runs in a couple minuter in a 2 Ghz single processor, and perhaps a couple more minutes for download and installation of packages (assuming broadband connection). Again, this is flexible, but the idea is to allow people to quickly get a number. For detail analysis we have lots of tests already, and people will keep adding good stuff.
Test candidates:
* For CPU_S I would say Super-Pi, it is the most popular test, fast and single threaded.
* For CPU_M: I tried graphics-magick. Just downloading and installing the test took about 15 minutes in my Sempron 2400+. Way too long for this, we'll need something else.
* Disk: what happen to bonnie? Was it any good? (I never tried it and it's not in PTS 1.8). I agree about IOZone, unless we can call it with arguments to make it faster. And just one iteration, even if we loose some accuracy. People can run the whole test a few times if they want an average, but most people won't care for a quick number.
The Graphics I still couldn't find good candidates. Except perhaps 2D performance: would one of the gtkperf be good for that? I think they measure 2D performance mostly, no?
If things start shaping up I'll keep a clean first post in this thread summarizing the progress.
Michael
04-10-2009, 09:33 PM
Stickied.
One thing that would be nice is for the selected tests to work on Linux, Mac OS X, OpenSolaris, and ideally BSD too.
- Super-Pi. I am not too fond of super-pi. Additionally, the license of super-pi is not clear and it's binary-only. Maybe scimark2 or something similar? Check out the computational suite.
- bonnie disappeared due to a parsing bug I haven't gotten around to fixing. IOzone really isn't accurate though unless the tested size is greater than the system memory size, which ends up needing options or to use some very large size default. As a result, maybe a compression test might end up working better.
gtkperf is good. Or maybe qgears2.
mendieta
04-10-2009, 10:55 PM
One thing that would be nice is for the selected tests to work on Linux, Mac OS X, OpenSolaris, and ideally BSD too.
Good point. Also, open source if at all possible. Maybe these core tests could be distributed (the sources) with the PTS, so there is no risk of some of the servers holding the tests being down or slow. Not sure about this, just a thought.
I'll look at the tests you suggested and other tests, and see if other people bring ideas/insight over the next few days. I'll also clean up the original post.
Michael
04-10-2009, 11:06 PM
Maybe these core tests could be distributed (the sources) with the PTS, so there is no risk of some of the servers holding the tests being down or slow. Not sure about this, just a thought.
Nope, won't happen. I will not begin distributing tests with PTS. However, with PTS Linux Live that is an option...
mendieta
04-11-2009, 09:57 AM
Yeah, you are right, it's much better to keep the test small and download stuff on demand.
mendieta
04-12-2009, 12:43 PM
I am looking into this.
* For 2D maybe the Circles test in gtk perf seems good, PixBufs seems good and fast too. Of course these are not "real world" test, But I doubt we can get real world tests in graphics. Qgears2 didn'd run here!
* For 3D I am looking at the GL suites, because real wrold tests demand downloading large games, GLMark Cored here. I'll keep looking.
* For disk all I've seen take a long time so far, Including fio. Maybe we can test disk and multicore cpu with a compile (it exercises lots of disj reads and writes). The issue is the time. Compiling Apache (the fastest compile test i've seen) takes 1 minute. Can we fdo just one compilation instead of 4?
* Single processor: scimark2 depends on Java, That;s a biggie, and it also can have issues if your system doesn't have a JIT compiler for Java. But I think we'll find a good single-threaded test which mosly exercises. This way, CPU_S would be mostly 1CPU + MEM, and CPU_M would be MultiCPU + Disk, Seems reasonable. Maybe single threaded music/video encoding would be good for this.
Michael
04-12-2009, 12:48 PM
There's two versions of scimark2: scimark2 and java-scimark2. Only the latter should depend upon Java.
ffmpeg is a nice encoding test.
mendieta
04-12-2009, 01:24 PM
Scimark2: sounds good then, I'll look a bit more. Ffmpeg is nice but the download is 10 minutes @ 186 K/s.
Apache: can we run it only once? (the compilation). It seems like a nice disk+multicore test.
mendieta
04-12-2009, 01:53 PM
Does the composite test in scimark2 run all the scimark2 tests sequentially? It seems like a nice test if so. Can we run it just once?
Also: all encoding tests seem to trigger the same data download:
=========================================
Downloading Files For: timed-audio-encode
Estimated Download Size: 74.99 MB
=========================================
I am really tempted to use scimark2 for CPU_1. It is also very portable (I looked at the code, it's a bunch of self-contained c code it seems ...)
mendieta
04-12-2009, 09:44 PM
For 3D, would trislam be a good candidate for this? It installed fine in my two machines, but when it runs it opens a window and it doesn't draw anything in it (it always looks black), then it reports the run time, Is that the way it is supposed to be?
Michael
04-12-2009, 09:57 PM
No, trislam wouldn't really be good for being a standard. The Perl OpenGL libraries are not too common on most Linux desktops and trislam isn't too real world representative.
Yes, that does sound about right for that test profile.
mendieta
04-12-2009, 11:08 PM
The unigine tests seem pretty interesting: realistic but also manageable download size, But they are coring in my machine, I wonder how easy it is to get them going in all the platforms of interest ... tests in this swuite should be pretty solid. Any sugestions for 3D?
mendieta
04-13-2009, 08:10 AM
The unigine tests seem pretty interesting: realistic but also manageable download size, But they are coring in my machine, I wonder how easy it is to get them going in all the platforms of interest ... tests in this swuite should be pretty solid. Any sugestions for 3D?
It seems like unigine really doesn't run in older cards. We need a more universal 3D test:
http://www.phoronix.com/forums/showthread.php?p=70198#post70198
Michael
04-13-2009, 08:34 AM
Unigine would be an excellent 3D test, especially as Unigine Corp is involved with PTS.
mendieta
04-13-2009, 08:47 AM
Unigine would be an excellent 3D test, especially as Unigine Corp is involved with PTS.
Sure, but what do we do about older cards? Something that could be done is just ignore the 3D test for them. Would that make sense? Can we ask Unigine Corp why they fail with a fatal error if that extension is not found? (or if there is a workaround for older cars)
mendieta
04-14-2009, 08:50 AM
Michael: let's summarize where we are, and there are a few questions that you haven't seen above:
CPU_S = Single Core CPU plus RAM. Test: Scimark2, Composite test [1]
CPU_M = Multiple Core CPU plus disk. Test: build-apache, one pass [2]
GUI_2D = 2D Performance. Test: gtkperf draw circles. [3]
GUI_3D = 3D Perfomance. Test: Unigine Sanctuary [4]
The global test scores would be as follows:
system-quick = power(CPU_S*CPU_M*GUI_2D*GUI_3D, 1/4)
system-quick-cli = power(CPU_S*CPU_M, 1/2)
system-quick-gui = power(GUI_2D*GUI_3D, 1/2)
Note that the score of system-quick is also the geometric mean of the other two. When you run a test, besides the global score we can show the individual scores, like in geekbench: a big number, and details beneath.
Questions:
[1] Does the composite aggregate all the individual tests od scimark2? That would be best.
[2] The regular test for build-apache builds it 3 times, way too long for this. Can we build it just once in this test?
[3] Draw circle may be limited. Can we run all "draw" tests sequentially and aggregate? (maybe not)
[4] What do we do in cases where Unigine fails? (older cards). Maybe we should only show the cpu score in that case.
I think we are getting there. Best!
grigi
04-14-2009, 10:46 AM
I think compiling is a terrible way to test disk scores. You need a way to test the pure disk performance.
For normal usage random read/write is more important than sequential. So could you limit IOZone to only test 4K random reads/writes? According to the interesting article writthen by Anand from Anandtech, random writes is almost the most noticeable feature of your disk subsystem. So we could give random writes more of an importance.
I also think using power values doesn't seem quite right. When one compares a score of 3000 to a score of 6000, the "6000" PC is roughly double as fast. Linear scale makes sense...
A Score of:
* Fastest thread
* Total processing power
* Random Disk performance
* 2D (Gui) performance
* 3D performance
It might be usefull to split 3D into 2 parts, e.g. Simple 3D and Advanced 3D.
Then Ungine can be used for Advanced 3D, and if it fails for whatever reason a score of 0 is acceptable.
It would be great to have one number, but the problem with that is that is that it is very misleading. So we need to show the sub-values (e.g. 6 of them) quite prominently.
mendieta
04-14-2009, 11:18 AM
Thanks for the input, grigi. We need more people jump in!
I think compiling is a terrible way to test disk scores. You need a way to test the pure disk performance.
For normal usage random read/write is more important than sequential. So could you limit IOZone to only test 4K random reads/writes? According to the interesting article writthen by Anand from Anandtech, random writes is almost the most noticeable feature of your disk subsystem. So we could give random writes more of an importance.
That's an interesting idea (adding a fast, dedicated disk efficiency test). We re trying to focus on real world 9as opposed to synthetic) measurement though, is there any "real world" test that could serve though? Otherwise the idea is interesting, and we may just want to do what you are proposing.
Regarding the average, did you read the third post in detail in the 1st page of the thread? (and the wikipedia article) I think a geometric mean makes a lot of sense in this case. If all the components are 3 times faster than the baseline system, your score is 3. For systems much faster than the baseline, the arithmetic mean will more evenly consider all components. Consider 4 components, and you get scores 3,8,9,10 (one subsystem is much slower). The geom-mean is 6.82 and the arith-mean 7.50. If you speed up the slow component from 3 to 6, your score jumps 19% in the geom-mean, to 8.11, but only 10% in the arith-mean. Still, we could use arith-mean if everyone feels it's best.
Coming up with a single number is always arbitrary, but useful to have an idea of where your system stands. Regardless, yes, the idea is to show all the individual scores.
One of the goals is to keep the whole test as fast as possible, two 3D tests would probably be overkill, no?
Many thanks!
mendieta
04-14-2009, 11:36 AM
Just to be very specific:
I also think using power values doesn't seem quite right. When one compares a score of 3000 to a score of 6000, the "6000" PC is roughly double as fast. Linear scale makes sense...
The geometric mean does scale linearly if all components scale linearly. You get that each factor is, say, twice as fast, so you score is multipled by power(2 ^ n, 1/n) = 2. With four components, the new score (with all components twice as fast is multiplied by the fourth root of 2^4, that is 2. Maybe the notation is not clear, the wikipedia article is nicer :-)
http://en.wikipedia.org/wiki/Geometric_mean
grigi
04-14-2009, 11:55 AM
Eh, sorry. I missed that. A geometric mean makes more sense, since we want to favour lower scores (lower scores tend to indicate some bottleneck, and us users notice the bottlenecks)
The reason I'm thinking of 2 3D tests is that Ungine doesn't run on any of the opensource drivers at the moment, but the overall gaming experience isn't too bad. Maybe we should get one 3D app that can fall-back to less features, but not at the expense of quality (not going to happen).
Michael
04-14-2009, 01:08 PM
Questions:
[1] Does the composite aggregate all the individual tests od scimark2? That would be best.
[2] The regular test for build-apache builds it 3 times, way too long for this. Can we build it just once in this test?
[3] Draw circle may be limited. Can we run all "draw" tests sequentially and aggregate? (maybe not)
[4] What do we do in cases where Unigine fails? (older cards). Maybe we should only show the cpu score in that case.
I think we are getting there. Best!
1. Yes, well, internally it does that I believe. The scimark2 composite option is within the Scimark2 program itself, but I believe that's how it roughly behaves.
2. I could add in a force option quite easily, but I'll need to think whether it's the right thing to do since just one run could be inaccurate in some cases.
4. Fallback to reporting 0 for graphics or something.
mendieta
04-14-2009, 01:30 PM
The reason I'm thinking of 2 3D tests is that Ungine doesn't run on any of the opensource drivers at the moment, but the overall gaming experience isn't too bad. Maybe we should get one 3D app that can fall-back to less features, but not at the expense of quality (not going to happen).
I agree, somewhere in the thread I proposed something similar. If we go the way you propose, we should have a "normal" score (using Unigen for 3D) and a "legacy" score when using the legacy 3D test. Do you have any suggestions for a legacy test (the lighter and quicker the better)
mendieta
04-14-2009, 01:36 PM
For normal usage random read/write is more important than sequential. So could you limit IOZone to only test 4K random reads/writes? According to the interesting article writthen by Anand from Anandtech, random writes is almost the most noticeable feature of your disk subsystem. So we could give random writes more of an importance.
Do you have a link to the article? Would be useful! It strikes me that lots of small random r/w will use mostly disk cache for the writes (not for the reads) ... or is it that if you push the disk to the limit it is unable to use the cache?
Also: I am by no means a guru. But in my overclocking experience a few years back with my current system I used a compilation test to measure progress, and it seemed clear to me that the disk was the bottleneck. Of course you are reading/writing files all the time in a build, but it may be that all these read/writes are mostly using the pretty fast cache of the disk. In the end, we care about the disk in terms of how it slows down loading a game, booting up, compiling (if it has an effect), etc ...
Again, we might as well use a synthetic test for disk. The discussion itself is fun anyways :-)
mendieta
04-14-2009, 03:34 PM
A little more info on compilation and disk performance. This guy finds a 20% speedup by compiling in RAM:
http://techblog.tomfanning.eu/2008/06/compiling-c-code-in-ram-disk.html
The disk clearly speeds up compilation but maybe not that much (RAM is like an infinitely fast disk, and it only gives you around 20%)
grigi
04-15-2009, 01:57 AM
But that is the point, Disk may be a bottleneck if it is TOO SLOW, but once it is fast enough (or the CPU is slow enough) it doesn't matter anymore.
Hence the compiling test does not scale with disk performance past a certain point.
A "reccomended" benchmark should be able to scale indefnitely.
grigi
04-15-2009, 01:59 AM
Anandtech article:
http://www.anandtech.com/storage/showdoc.aspx?i=3531
It is a very long article, but very educational. Read it all.
mendieta
04-15-2009, 07:56 AM
Anandtech article:
http://www.anandtech.com/storage/showdoc.aspx?i=3531
It is a very long article, but very educational. Read it all.
Great Read, I really need an ssd in my new desktop :p ! I'll wait for prices to drop though!
But that is the point, Disk may be a bottleneck if it is TOO SLOW, but once it is fast enough (or the CPU is slow enough) it doesn't matter anymore.
Hence the compiling test does not scale with disk performance past a certain point.
Agreed!
So, we need to measure disk better, and an alternative for 3D when Unigine fails. Any thoughts, Michael?
Michael
04-15-2009, 09:46 AM
So, we need to measure disk better, and an alternative for 3D when Unigine fails. Any thoughts, Michael?
Instead of using Unigine, another alternative could be to just use Nexuiz. But even there that usually runs slow (or not at all) with the Mesa stack.
grigi
04-15-2009, 12:45 PM
Nexuiz runs on both the Intel and ATI mesa stack (unbelivably slow on the intel one, but very playable on the mobility x1600), so that is probably a very valid benchmark.
I wonder if we could get a nexuiz download to be smaller than the 600-odd megabyte that the current one is?
Maybe ask the nexuiz guys if they can make a smaller benchmark distribution?
mendieta
04-15-2009, 02:28 PM
Nexuiz runs on both the Intel and ATI mesa stack (unbelivably slow on the intel one, but very playable on the mobility x1600), so that is probably a very valid benchmark.
I wonder if we could get a nexuiz download to be smaller than the 600-odd megabyte that the current one is?
Maybe ask the nexuiz guys if they can make a smaller benchmark distribution?
Yes, that's my biggest concern with nexuiz. All we need is a small demo, really.
mendieta
04-17-2009, 08:07 AM
We have a test candidate !
http://global.phoronix-test-suite.com/?k=profile&u=mendieta-4549-6954-342
If anyone could run their systems against this, it would be great. Just do as below, and please post a link to the results here
phoronix-test-suite benchmark mendieta-4549-6954-342
In particular, I'd like to set a benchmark so we can compute also machine scores according to the algorithm in the first post (we can do it manually for now). Maybe an atom based netbook (I think you need at least 9 inch display for the 3D test to run). Can someone please run in such a system?
It's not the last word (we can still change/modify/improve the test), but here are my thoughts:
Added fio "server load" test, which does lots of random read/writes all over the place. This should give the disk a run for its money :-) I think it's along the lines of Grigi's suggestion.
Since we have a separate test for the disk, I used openssl for a multicore cpu test. It runs faster, it's a small download, and scales very well with CPU speed and number of processors
For 3D I am using Norsetto-Shadow, because it's a small download. It won't push the newest cards to the limit, but it should run on pretty anything, and should scale with GPU speed and bandwidth (the size of the GPU memory won't matter).I think it's a lot better than glxgears, but small/fast enough to fit the goal of this test.
For 2D I think the gtkperf combobox tests are more stable and scale better than the gtkperf "draw" tests, so I used one of these.
Michael
04-17-2009, 09:03 AM
http://global.phoronix-test-suite.com/?k=profile&u=phoronix-11759-5235-18676
Norsetto Shadows didn't work on it. That's really not a good test.
mendieta
04-17-2009, 09:24 AM
Thanks for running this!
http://global.phoronix-test-suite.com/?k=profile&u=phoronix-11759-5235-18676
Norsetto Shadows didn't work on it. That's really not a good test.
Ok, why don't we do it the other way around, we find some not too large 3D test that runs in the mini in a reasonable amount of time? Also: I had to symlink libGL.so to libGl.so.1.2 in my machine to be able to compile Norsetto.
What happened with fio? Do you have little space left in the disk?
I wonder why gtkperf was so close in both tests. I thought the Radeon HD 3200 would beat the mini's Intel embedded, maybe it's ok.
nickgretsky
05-11-2009, 12:05 PM
There is currently no system-quick suite, but if you propose a suite with what tests to include and such, I would be happy to make it. Or if you run: phoronix-test-suite build-suite you can create the suite and I would be glad to push it upstream.
kdlucas
06-22-2009, 04:15 PM
Has anyone considered using UnixBench 5.1.2? Ian Smith updated the older UnixBench in Dec, 2007, and added in some graphics testing as well.
mo0n_sniper
06-22-2009, 05:18 PM
Michael: let's summarize where we are, and there are a few questions that you haven't seen above:
CPU_S = Single Core CPU plus RAM. Test: Scimark2, Composite test [1]
CPU_M = Multiple Core CPU plus disk. Test: build-apache, one pass [2]
GUI_2D = 2D Performance. Test: gtkperf draw circles. [3]
GUI_3D = 3D Perfomance. Test: Unigine Sanctuary [4]
The global test scores would be as follows:
system-quick = power(CPU_S*CPU_M*GUI_2D*GUI_3D, 1/4)
system-quick-cli = power(CPU_S*CPU_M, 1/2)
system-quick-gui = power(GUI_2D*GUI_3D, 1/2)
Note that the score of system-quick is also the geometric mean of the other two. When you run a test, besides the global score we can show the individual scores, like in geekbench: a big number, and details beneath.
Questions:
[1] Does the composite aggregate all the individual tests od scimark2? That would be best.
[2] The regular test for build-apache builds it 3 times, way too long for this. Can we build it just once in this test?
[3] Draw circle may be limited. Can we run all "draw" tests sequentially and aggregate? (maybe not)
[4] What do we do in cases where Unigine fails? (older cards). Maybe we should only show the cpu score in that case.
I think we are getting there. Best!
The direction looks good.
keep it up :)
Yomp!!
07-31-2009, 12:15 AM
We have a test candidate !
http://global.phoronix-test-suite.com/?k=profile&u=mendieta-4549-6954-342
If anyone could run their systems against this, it would be great. Just do as below, and please post a link to the results here
phoronix-test-suite benchmark mendieta-4549-6954-342
In particular, I'd like to set a benchmark so we can compute also machine scores according to the algorithm in the first post (we can do it manually for now). Maybe an atom based netbook (I think you need at least 9 inch display for the 3D test to run). Can someone please run in such a system?
Don't know if you care to see more tests on that, but I did one as well, I did it on specs similar to yours, 2.4 Ghz Phenom X3, 9600 GT, 4 gigs of ram and Ubuntu 9.04 x86_64, and I did another with the processor overclocked to 3.0 Ghz.
http://global.phoronix-test-suite.com/index.php?k=profile&u=kyle-713-17058-31074
fxfuji
08-15-2009, 03:21 PM
I'm probably coming in too late into this discussion, but it seems to me that one test with 'real-world' utility would be to measure the load (CPU and RAM) imposed on a system playing Flash videos on the YouTube, Hulu, etc. websites (using Adobe's Flash plugin inside a browser, that is).
I'm particularly interested in whether Atom-based netbooks are up to the task... even my 2 GHz dual-core has trouble with this task sometimes. :(
Does such a test exist in the PTS (or could one be added from elsewhere)?
prhone
08-20-2009, 05:39 PM
It might be a good idea to have a test of disk performance that is relatively independent of the CPU or other hardware - a sort of cross device standard. For this to work, it would be necessary define a ratio such as (disk performance)/(CPU of System benchmark), which has a value of 100 for a specfic hardware setup.
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.