Page 6 of 18 FirstFirst ... 4567816 ... LastLast
Results 51 to 60 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

  1. #51
    Join Date
    Oct 2012
    Location
    Cologne, Germany
    Posts
    308

    Cool BareMetal OS

    Quote Originally Posted by TemplarGR View Post
    This made my day...

    BTW, since when were Roller Coaster Tycoon written in assembly? Any proof?
    Even though I also don't think big projects should be written in ASM, there is a parallel-computing-OS entirely written in ASM: BareMetal OS

    To be honest, this is not much compared to the Linux-Kernel, but it is a very interesting project!

  2. #52
    Join Date
    Sep 2012
    Posts
    650

    Default

    Quote Originally Posted by Obscene_CNN View Post
    I didn't suggest that windows 8 be written in assembly, just abandon the crappy of philosophy ease of development far out weighs all other considerations.
    In the real world, a program has specifications. You go the fastest and cheapest route to meet them. You don't overachieve specs, because that's worthless. So yes, ease of development far out weighs all other considerations, at least as long as you can still do it and meet your specs.
    And to measure that, you use profiling, you don't just say "wow, I could optimize that" randomly.

    Quote Originally Posted by Obscene_CNN View Post
    Hell if they would just abandon the use of C++ templates it would cut the size to about 1/4th the size it is now.
    OT, but I'm interested, you would replace templates with what? At what cost?

  3. #53
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by wargames View Post
    AFAIK compilers don't produce optimized assembly using SIMD instructions, which is the only "corner case" that comes to my mind for using assembly along with writing kernel code that directly interfaces with the hardware.
    I'm 99% that they do: look for AutoVectorize into GCC or LLVM. Even Visual Studio (2012) produces it, to not say anything about Intel's Compilers.

    In fact if you write properly a matrix multiply, you will find that with -O3 in release mode GCC basically generates the very same code that your hand-written assembly will write.

    AutoVectorize depends mostly on loop unrolling to work smooth, and some lower end optimizations levels don't do it extensively because of code bloat, so without doing flag mining, sometimes even with the best compilers you cannot get best performance.

    Assembly code cannot get any not more optimization (and maintainable) when a compiled code can and will. If you know that you're writing a ray tracer (or something that really require a fast executable), you can use PGO to profile your application and the compiler will inline and optimize selectively your code based on your usage. Using LTO will do easier inlining decisions. It is humanly impossible to trace huge chunk of code by hand but a compiler can do it. It may require many MB to track all your variables, but I digress: why should you care if this would happen or not.

    At last, I really like the balance comment about "managed vs native", because most compilers target high level code to write it well, managed code performance has a minimal gap (I would argue that it still exists and it will always will) that it makes it so suitable for most applications to run. A lot of tools in Fedora (or Ubuntu) use Python or Perl to update packages for example and maybe for a slow machine a full update would not take 1 hour and 3 minutes, but just 1 hour if all scripts will be in the fastest alternative, but hey, Python and Perl are mostly interpreted! I'm also using IntelliJ IDEA and I see just a "startup gap" compared with native code, but after 1 minute of usage, it just works, like thunder.

  4. #54

    Unhappy Apologies :-(

    Quote Originally Posted by oliver View Post
    While not surprising, extremly disappointing is this little bit:


    The disappointing bit of course is that it is in excel, or at least saved to an xls file. I would have expected that the 'better format' where an ods (Open Document Spreadsheet). But then again, while Linaro is a linux group, for linux on arm, I'm pretty sure most just use Windows, Office and just play with Linux via a VM.
    Actually, no. Most of us are real Free Software developers used to working on distros, compilers and other packages. I'm a Debian developer (and Debian Project Leader emeritus), for example.

    The .xls thing was purely a daft mistake on my part. I've been working locally in Gnumeric (my spreasheet of choice) using its native format, but uploaded in a different format to make it easier for others. I should have picked .ods, yes. Apologies.

  5. #55
    Join Date
    May 2012
    Posts
    425

    Default

    Quote Originally Posted by ciplogic View Post
    Assembly code cannot get any not more optimization (and maintainable) when a compiled code can and will. If you know that you're writing a ray tracer (or something that really require a fast executable), you can use PGO to profile your application and the compiler will inline and optimize selectively your code based on your usage. Using LTO will do easier inlining decisions. It is humanly impossible to trace huge chunk of code by hand but a compiler can do it. It may require many MB to track all your variables, but I digress: why should you care if this would happen or not.
    or you can use perf that will even color code the disassembled assembly to show exactly what instruction is the bottleneck
    and its way easier to change/rearrange instructions in assembly then to make a compiler understand what its doing wrong

    so no, compiled code (in general) will never ever beat some1 that knows what hes doing

    bdw, in fact i plan to make a raytracer
    so far from what i understand of the mathematics, its almost all some matrix operations
    i learned how to use sse for matrix operations (matrix multiplication) but this kind of matrix math will require lots more thinking and creativity
    all for that couple % over what a compiler can do, it probably wont be much but hey

  6. #56
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    or you can use perf that will even color code the disassembled assembly to show exactly what instruction is the bottleneck
    and its way easier to change/rearrange instructions in assembly then to make a compiler understand what its doing wrong

    so no, compiled code (in general) will never ever beat some1 that knows what hes doing

    bdw, in fact i plan to make a raytracer
    so far from what i understand of the mathematics, its almost all some matrix operations
    i learned how to use sse for matrix operations (matrix multiplication) but this kind of matrix math will require lots more thinking and creativity
    all for that couple % over what a compiler can do, it probably wont be much but hey
    May you use GCC -s (to export assembly) and compare an instance when you get much better assembly code by your hand?

  7. #57
    Join Date
    May 2012
    Posts
    425

    Default

    Quote Originally Posted by ciplogic View Post
    May you use GCC -s (to export assembly) and compare an instance when you get much better assembly code by your hand?
    you can also do it with "objdump -d" and "perf record"(that uses objdump and libelf)
    but since gcc dosent care if registers are in numerical order and does more passes in stages it can become hard for a human to read
    even with original source code "overlayed" over the disassembly

    also a whole program can be over a megabyte, even parts of it (.o) can go over 300k
    thats a lot of code


    mostly you'd use assembly, be it inline or linked later, to write short functions
    so classical approach is best
    "use it where it is best to use it"

    i'm only bothered by people saying things like "you cant" and "compiler is better"
    only thing i'm missing with FASM is a simple debugger, i just need to know where it crashed and values of registers/stack
    when i get less lazy il make one; i thought that's hard too but its simple rly

    edit: about matrix; it has too be SIMD cuz its a lot of parallel calculations
    you can compare and learn from the compiler, but as everybody says compilers sometimes give... slower code (as in algorithm)
    Last edited by gens; 04-04-2013 at 05:46 PM.

  8. #58
    Join Date
    Apr 2013
    Posts
    1

    Default

    Quote Originally Posted by Obscene_CNN View Post
    Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

    Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.
    I agree with your original post that you can realize significant performance improvement with Assembly or "fast tight code" in general. I also think that code should not be pre-optimized. I believe that code should be clean and readable first. And in most cases, this code will be fast enough. If it's not, then it should be profiled and the slow parts optimized.

    I'm not sure what you were saying about profiling and small functions. I seems like you were saying that if code is made up a lot of small functions, that profiling them wouldn't pinpoint slowness because the CPU would only spend a small amount of time in each function. Well, there should be a higher level function that calls these small functions, and profiling would show the CPU spending more time in that higher level function. Then, if necessary, the clean/readable rules could be broken to optimize that area.

    Getting back to video drivers, yes, they should be optimized. Written to be clean, readable and working first. Then, again, the slow parts optimized. This gets back to not pre-optimizing.

    As far as MS products, I have no idea what they do to generate products with such a large footprint and performance issues. My guess is that they are more concerned with getting to market than getting to market with a good product.

  9. #59
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    (...)
    so classical approach is best
    "use it where it is best to use it"

    i'm only bothered by people saying things like "you cant" and "compiler is better"
    only thing i'm missing with FASM is a simple debugger, i just need to know where it crashed and values of registers/stack
    when i get less lazy il make one; i thought that's hard too but its simple rly

    edit: about matrix; it has too be SIMD cuz its a lot of parallel calculations
    you can compare and learn from the compiler, but as everybody says compilers sometimes give... slower code (as in algorithm)
    I say: you can't write it better than today's compiler do, and if you do it, are in very small cases, where is a compiler bug or you crafted specially against your compiler. In short: "you cant" and "compiler is better"

    Try to write your SIMD in assembly and take a matrix multiply for this function and see which gives better assembly code.

    In the past I did wrote (like 10 years ago) a memcopy that was working like 2-3 x faster than the runtime (I was using Delphi as starting point), but later Delphi and Window's memcpy was faster than my code, and even a loop that was not copying a byte at a time, but an integer, was really fast enough (like in 90% of my assembly optimized code) for all purposes. Today's compilers will do this memcpy maybe even faster.

    About SIMD, yes, most compilers in release mode generate SIMD code if is safe to do it. Even Mono does it (using Mono.SIMD)! And Mono is not a high end compiler. As the code size grows, is very unlikely that a normal user, even with 4-5 years of experience in C++ can write better code using assembly. It will either break the CPU's out-of-order pipeline, or it will forget to move a variable out of loop, or it will not think to put all the data to run in the L1 cache.

    One last item, your assembly will remain always your baseline performance for your code. I mean by that if you write optimized low level code instead of using a library, your code will be as fast or as slow as the code is, but you will not benefit when a new Java JIT optimization will target your Java code, or a new LLVM register allocator, or a new GCC LTO that will inline your function aggressively, and GCC will remove one parameter, making a clone of your function and it will inline locally the variable you called your method with one paremeter constant. When you put assembly, you break your opportunity that tomorrow your code can run better (if it doesn't run it already).

    Jake2 was running better (and I think it still does) than the original C code. Java 5 was running faster, but today Java 7 (and soon 8) has a better GC throughput, escape analisys, so even the benchmarks of Jake2 that were showing that Java was faster than C code can show a difference even wider. And Quake2 was optimized at places with assembly and was running their C compilers of the time.

  10. #60
    Join Date
    May 2012
    Posts
    425

    Default

    Quote Originally Posted by ciplogic View Post
    I say: you can't write it better than today's compiler do, and if you do it, are in very small cases, where is a compiler bug or you crafted specially against your compiler. In short: "you cant" and "compiler is better"

    Try to write your SIMD in assembly and take a matrix multiply for this function and see which gives better assembly code.

    In the past I did wrote (like 10 years ago) a memcopy that was working like 2-3 x faster than the runtime (I was using Delphi as starting point), but later Delphi and Window's memcpy was faster than my code, and even a loop that was not copying a byte at a time, but an integer, was really fast enough (like in 90% of my assembly optimized code) for all purposes. Today's compilers will do this memcpy maybe even faster.

    About SIMD, yes, most compilers in release mode generate SIMD code if is safe to do it. Even Mono does it (using Mono.SIMD)! And Mono is not a high end compiler. As the code size grows, is very unlikely that a normal user, even with 4-5 years of experience in C++ can write better code using assembly. It will either break the CPU's out-of-order pipeline, or it will forget to move a variable out of loop, or it will not think to put all the data to run in the L1 cache.

    One last item, your assembly will remain always your baseline performance for your code. I mean by that if you write optimized low level code instead of using a library, your code will be as fast or as slow as the code is, but you will not benefit when a new Java JIT optimization will target your Java code, or a new LLVM register allocator, or a new GCC LTO that will inline your function aggressively, and GCC will remove one parameter, making a clone of your function and it will inline locally the variable you called your method with one paremeter constant. When you put assembly, you break your opportunity that tomorrow your code can run better (if it doesn't run it already).

    Jake2 was running better (and I think it still does) than the original C code. Java 5 was running faster, but today Java 7 (and soon 8) has a better GC throughput, escape analisys, so even the benchmarks of Jake2 that were showing that Java was faster than C code can show a difference even wider. And Quake2 was optimized at places with assembly and was running their C compilers of the time.
    i started this hobby in the time of... i guess gcc 4.6.something
    gcc has changed since then and i didnt look at disassembly's in a while

    well anyway
    memcpy in gcc is builtin, meaning it will just copy a template function
    if you dont tell it to use sse/avx, that function will be something like
    "rep mosb" or to copy more "rep movsd" or in 64bit "rep movsq"
    the amd64 C calling convention is good for this as it puts the pointers and the counter in the right places
    (actually 3 lines; "label" "rep movsb" "ret" should be a working memcpy, for those that dont know)

    if you tell it to use sse, idk if it will and if it does its probably from a template
    if you use "-fno-builtin" as documented here it could do worse
    meaning that for functions that are not popular the compiler will think and probably do worse if they are not simple

    one example is here, where the author asked for help and ended with code twice as fast as fortran

    about matrix multiply
    i did write a 3x3 matrix with 1x3 matrix multiply, albeit in intrinsics
    problem was sse processes 4 floats at a time and theres lots of 3x3 matrices
    my solution was to load an extra number from the next matrix (with shuffles) and do 4 matrices in a loop (4x3=12, that is dividable by 4 giving 3 steppes)
    idk how a compiler can come up with this solution, especially since it dosent know that that loop will process thousands of matrices
    funny thing is i had a lot more problems with pointers in C++ (im bad at C++) and then the hard drive failed

    also about cache
    true that a compiler respects cache lines and L2 cache size, but it also gives out a long unwound bunch of machine code
    also there is no specification about cache sizes (-mcpu dosent help since its a flag for an architecture, and one can have different cache sizes for one architecture)

    i think i can and in at least one case i did
    took me longer that it would in higher level languages, but that loop was running 5% of total cpu time and i had nothing better to do

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •