Also, if you look for the times when the things were posted, to get a solution for this simple NBody program took 2 months (!) to fix it. Of course, maybe tomorrow someone will port it to OpenCL and will work 10 times faster than the assembly code and using less CPU in the process: http://developer.apple.com/library/m...ion/Intro.htmlQuote:
Thank for any help in optimizing. Seems that I am
bad at assembler Wink
edit: I have followed fortran program and rearranged
loops so that magnitude calculation can be vectorized
and now program executes at 16 seconds which is
still two seconds behind fortran Wink
edit2: did some minor arrangement's of code so now it executes
at Intel Fortran's speed.
Or this video: https://www.youtube.com/watch?v=r1sN1ELJfNo
What would stop you to do this extra float to add it to make the multiplications easy on SSE? (at least in case it shows in profiling)
I mean: why not help the compiler a bit where you know is a bit weaker? If you know that C++ cannot compile across the object files, you should not create too many getters in other place than in headers, right? But if you think that: why not use assembly to improve the things out? I think the reason is that maybe tomorrow ARM will be popular, or a processor that even is Intel compatible, the instruction sequence has to be different (AMD and Intel are having many times different caches and latencies, which makes that compiler can help a lot on tuning for one CPU or another).
I want to make it clear, there are cases when as you said, are loops that are even more than 5% CPU and assembly seems to be the reason, but many more times is an issue of application design. I remember again from my past that when I did OpenGL, I was pushing triangle by triangle (as this was shown in mainstream tutorials of year's 2000), but never VBOs, glMatrixLoad, etc. and many processing were done on CPU. Today, if you know that you have a lot of processing, you may want to write it into Java and execute it into the cloud distributed which will give to you the answer properly and fast, or use OpenCL, or use all cores, and you don't care which does what.Quote:
also about cache
true that a compiler respects cache lines and L2 cache size, but it also gives out a long unwound bunch of machine code
also there is no specification about cache sizes (-mcpu dosent help since its a flag for an architecture, and one can have different cache sizes for one architecture)
i think i can and in at least one case i did
took me longer that it would in higher level languages, but that loop was running 5% of total cpu time and i had nothing better to do
I know that NBody (in the your shown case) is not multi-core aware, but even the previous month at work, I had some native code and as it had to do a massive processing task (basically to crack passwords), moving into multi-core was done much easier with Java code and at the end, just using basically "Java -server" and by using all cores it was running basically 6-8 times faster. Of course 1 core C++ vs 4 HT (4 x 2) cores Java. Also I think that Java optimized very much the trace of code that happen to be run in that specific project, so I wouldn't give the conclusion like "Java is faster than C++" or anything of the sort. Also C++ and Java were using different frameworks to do that password checking. I am sure that you could point that many inefficiencies could be written into Assembly and will get 3x times faster than Java, and 20 times faster than using a framework in C++ of that specific task. But of course no sane developer will rewrite the C++ big library into assembly and optimize it and make it multicore at the end just for my sake. I also consider that even there would be a C++ multi-core version (which there is none for now, for my specific problem), still I would prefer Java, even C++ would be let's say 10% faster, even the task is time consuming, and the reason I've told you in a previous post: Java 8 will get some performance updates, if not, Java 9 with Project Jigsaw. With compiled C++ version, I'm stuck (I will have to recompile every time). At last, Java would allow me to move it my code into cloud, so I can have "infinite scalability". Do you know any cloud letting you to run Assembly code?
At last, optimization is many times misdirected, as this guy told it fairly nicely three years ago: