Results 1 to 10 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

Threaded View

  1. #13
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    400

    Default

    Quote Originally Posted by gens View Post
    i started this hobby in the time of... i guess gcc 4.6.something
    gcc has changed since then and i didnt look at disassembly's in a while

    well anyway
    memcpy in gcc is builtin, meaning it will just copy a template function
    (...)
    So to understand: you accept that writing a MemCopy like function in the past, is was a bad idea, because as of today the builtin functions are doing better, right? But if you would write your code and even you were close to an expert to write it at the time of Intel32 Pentium 1 glory, maybe the version that GCC will provide, still will be better
    Quote Originally Posted by gens View Post
    i
    one example is here, where the author asked for help and ended with code twice as fast as fortran
    Yes, but also it shows how good was the compiler:
    Thank for any help in optimizing. Seems that I am
    bad at assembler Wink

    edit: I have followed fortran program and rearranged
    loops so that magnitude calculation can be vectorized
    and now program executes at 16 seconds which is
    still two seconds behind fortran Wink

    edit2: did some minor arrangement's of code so now it executes
    at Intel Fortran's speed.
    Also, if you look for the times when the things were posted, to get a solution for this simple NBody program took 2 months (!) to fix it. Of course, maybe tomorrow someone will port it to OpenCL and will work 10 times faster than the assembly code and using less CPU in the process: http://developer.apple.com/library/m...ion/Intro.html

    Or this video: https://www.youtube.com/watch?v=r1sN1ELJfNo

    Quote Originally Posted by gens View Post
    i
    about matrix multiply
    i did write a 3x3 matrix with 1x3 matrix multiply, albeit in intrinsics
    problem was sse processes 4 floats at a time and theres lots of 3x3 matrices
    my solution was to load an extra number from the next matrix (with shuffles) and do 4 matrices in a loop (4x3=12, that is dividable by 4 giving 3 steppes)
    idk how a compiler can come up with this solution, especially since it dosent know that that loop will process thousands of matrices
    funny thing is i had a lot more problems with pointers in C++ (im bad at C++) and then the hard drive failed
    So you had a 3x3 matrix, and you understood that a SSE register would pack 4 floats (32).

    What would stop you to do this extra float to add it to make the multiplications easy on SSE? (at least in case it shows in profiling)

    I mean: why not help the compiler a bit where you know is a bit weaker? If you know that C++ cannot compile across the object files, you should not create too many getters in other place than in headers, right? But if you think that: why not use assembly to improve the things out? I think the reason is that maybe tomorrow ARM will be popular, or a processor that even is Intel compatible, the instruction sequence has to be different (AMD and Intel are having many times different caches and latencies, which makes that compiler can help a lot on tuning for one CPU or another).

    also about cache
    true that a compiler respects cache lines and L2 cache size, but it also gives out a long unwound bunch of machine code
    also there is no specification about cache sizes (-mcpu dosent help since its a flag for an architecture, and one can have different cache sizes for one architecture)

    i think i can and in at least one case i did
    took me longer that it would in higher level languages, but that loop was running 5% of total cpu time and i had nothing better to do
    I want to make it clear, there are cases when as you said, are loops that are even more than 5% CPU and assembly seems to be the reason, but many more times is an issue of application design. I remember again from my past that when I did OpenGL, I was pushing triangle by triangle (as this was shown in mainstream tutorials of year's 2000), but never VBOs, glMatrixLoad, etc. and many processing were done on CPU. Today, if you know that you have a lot of processing, you may want to write it into Java and execute it into the cloud distributed which will give to you the answer properly and fast, or use OpenCL, or use all cores, and you don't care which does what.

    I know that NBody (in the your shown case) is not multi-core aware, but even the previous month at work, I had some native code and as it had to do a massive processing task (basically to crack passwords), moving into multi-core was done much easier with Java code and at the end, just using basically "Java -server" and by using all cores it was running basically 6-8 times faster. Of course 1 core C++ vs 4 HT (4 x 2) cores Java. Also I think that Java optimized very much the trace of code that happen to be run in that specific project, so I wouldn't give the conclusion like "Java is faster than C++" or anything of the sort. Also C++ and Java were using different frameworks to do that password checking. I am sure that you could point that many inefficiencies could be written into Assembly and will get 3x times faster than Java, and 20 times faster than using a framework in C++ of that specific task. But of course no sane developer will rewrite the C++ big library into assembly and optimize it and make it multicore at the end just for my sake. I also consider that even there would be a C++ multi-core version (which there is none for now, for my specific problem), still I would prefer Java, even C++ would be let's say 10% faster, even the task is time consuming, and the reason I've told you in a previous post: Java 8 will get some performance updates, if not, Java 9 with Project Jigsaw. With compiled C++ version, I'm stuck (I will have to recompile every time). At last, Java would allow me to move it my code into cloud, so I can have "infinite scalability". Do you know any cloud letting you to run Assembly code?

    At last, optimization is many times misdirected, as this guy told it fairly nicely three years ago:
    http://pl.atyp.us/wordpress/index.ph...because-its-c/
    Last edited by ciplogic; 04-05-2013 at 06:48 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •