Page 4 of 18 FirstFirst ... 2345614 ... LastLast
Results 31 to 40 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

  1. #31
    Join Date
    May 2009
    Location
    Richland, WA
    Posts
    134

    Default

    You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

    You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

    This problem is so bad that people don't even try to write fast tight code even in video drivers.

    example of video driver code (changed to protect the identity of the guilty)

    static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
    {
    int i, r;
    struct some_gpu_bytecode_alu alu;

    for (i = 0; i < 4; i++) {
    memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

    alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

    alu.dst.sel = ctx->shader->input[input].gpr;
    alu.dst.write = 1;

    alu.dst.chan = i;

    alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
    alu.src[0].chan = i;

    if (i == 3)
    alu.last = 1;
    r = some_alu_bytecode_add_alu(ctx->bc, &alu);
    if (r)
    return r;
    }
    return 0;
    }

    How it should be written for performance

    static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
    {
    int i, r;
    struct some_gpu_bytecode_alu alu;

    memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

    alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

    alu.dst.sel = ctx->shader->input[input].gpr;
    alu.dst.write = 1;

    alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;


    for (i = 0; i < 4; i++) {

    alu.dst.chan = i;

    alu.src[0].chan = i;

    if (i == 3)
    alu.last = 1;
    r = some_alu_bytecode_add_alu(ctx->bc, &alu);
    if (unlikely(r))
    break;
    }
    return r;
    }

    One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.

  2. #32
    Join Date
    Feb 2012
    Posts
    67

    Default

    Quote Originally Posted by Obscene_CNN View Post
    You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.
    You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?

  3. #33
    Join Date
    May 2009
    Location
    Richland, WA
    Posts
    134

    Default

    Quote Originally Posted by Wildfire View Post
    You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?
    I have never encountered one. Can you show me one that can? (GCC can't)

    I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.

  4. #34
    Join Date
    Sep 2008
    Location
    Vilnius, Lithuania
    Posts
    2,636

    Default

    The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.

    Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.

  5. #35
    Join Date
    Jan 2012
    Posts
    18

    Default

    Quote Originally Posted by ldesnogu View Post
    Sorry but given that intrinsics are CPU/ISA-specific, you'd still have to port your code, which is the point of the article: what x86 code exists that needs ARM64 porting?
    That was more a reply to the discussion on the general the general usefulness of asm today, but even so I sure would prefer porting intrinsics-code over pure asm. If I'm the one to choose approach I prefer writing an abstraction layer for intrinsic code with force-inlines and macros if I have a lot of simd-code and a new platform.

  6. #36
    Join Date
    May 2012
    Posts
    659

    Default

    Quote Originally Posted by GreatEmerald View Post
    Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.
    databases, parts of 3D engines (probably gonna be moved to opengl, but still), simulators and everything needing specialized loops that make a big % in performance
    firefox and webkit/chrome have plenty loops in assembly
    oh, and HPC benefits from knowing the execution time of a loop as it can remove the need for semaphores (or whatever they call em)

    http://en.wikipedia.org/wiki/Kazushige_Goto

    bdw, different algorithms are best for different cpu's/coprocessors so writing a generic code can make some platforms not achieve best performance at all
    Last edited by gens; 04-02-2013 at 05:20 PM.

  7. #37
    Join Date
    May 2009
    Location
    Richland, WA
    Posts
    134

    Default

    Quote Originally Posted by GreatEmerald View Post
    The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.
    Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

    Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

  8. #38
    Join Date
    Apr 2008
    Posts
    34

    Default

    I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.
    You don't know that for sure -- unless you use a low -O option, and possibly -g, gcc doesn't worry about debuggers. And LTO (link time optimization) is not real commonly used yet, but with this used the compiler is not restricted to following C or C++ standards for passing (or returning) information from one function to the next (within your program -- of course, when calling a shared lib you do have to follow standards.) I've messed with LTO on gentoo and it builds most software correctly* (and nice speedups at times). Within a year or two I bet most distros use LTO -- essentially, packages won't follow C or C++ spec at all internally, just when making syscalls and library calls.

    Quote Originally Posted by Obscene_CNN View Post
    Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

    Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.
    Hear hear! The "it's so fast, optimization doesn't matter anyway" attitude is a real shame. I'm glad that in fact most software on Linux, they DO worry about the performance of it and don't actually take this attitude. Of course, the biggest gain is from avoiding the use of inefficient algorithms... I.e. O(1) is better than O(n), which is better than O(n^2). This is where programs really fall apart in terms of bloat is when some piece of code uses exponentially more time than it should, as opposed to some few percent different they haven't shaved off.

    ----------------------
    Back on topic...
    I'm unsurprised by this result. I've been using Linux since about 1994, and in general it's been a little assembly in jpeg/png/etc. decoding (but not going to kill an ARM to not have this.. plus I think there *is* ARM assembly for this), video encoding and decoding, and so on, not like a text editor or whatever. I'm guessing, those programs (chess, bzip2, gzip, etc.) that vary a lot between compiler versions could have an ideal assembly-language version that's nice and fast... where a lot of programs where the compiler makes little different are within a % or 2 of the ideal assembly version (there's nothing tricky to speed them up.) ARM is I think even an easier case, the instruction set isn't as complex as x86's... if DSP or NEON (SSE-like instructions) won't help, then that's that.

    *I don't think LTO build anything *incorrectly* any more, just fails to build a few packages.
    Last edited by hwertz; 04-02-2013 at 07:47 PM.

  9. #39
    Join Date
    Feb 2012
    Posts
    260

    Default

    I've got a small project going on (http://realboyemulator.wordpress.com); it is an emulator for the Nintendo Game Boy. I had the opportunity to implement "the core" of the emulator in x86-64. Using Assembly was more a matter of challenge; it would've been way easier doing it in any high-level language. Anyway here are some advantages I found for using Assembly:

    - I was able to keep frequently-accessed values in registers. For example, for simulating the instruction pointer, register %r13 was permanently used. This value is accessed ALL the time.
    - I had control over the exact layout of the data structures. For example, an array of structures each describing a machine instruction of the Game Boy's CPU. Each entry is 32 bytes, so indexing can be done through a fast shifting instead of multiplying. I think that this simple trick can be considerable in overall, because it's part of the fetch-and-execute cycle, and this loop is executed for each instruction to emulate.
    - Many more little tricks, some of them that take advantage of knowing particularities of the program.

    I know that compilers can do all this simple tricks, and quite possibly a compiler would generate better code than what I have done with RealBoy, because there are lots of architectural things I just ignore. However, I get the impression that, in sum, one is capable of doing a better work at optimizing for a particular program. One can take into account various particularities and combine lots of tricks that I think the compiler would have to be human to be capable of doing.

  10. #40
    Join Date
    Jul 2010
    Posts
    449

    Default

    Quote Originally Posted by Obscene_CNN View Post
    Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.
    You seem to be making the assumption that optimised code equates to smaller code.

    End users can and will always complain that their use-case wasn't treated preferentially. An end-user will almost always say "this should be faster", but if presented with the choice "accept it as it is" vs. "go without new features X, Y and Z for 2 months whilst we figure out what's going on, figure out a way of improving it and then subject it to QA", how many of those end-users will take the second option?

    I've taken calls from clients who demanded that I make X faster, yet when told "okay, I'll spend my time making X faster, but I'll have to stop working on Y to do so", decide that it isn't so important.

    Bottom line: end users want the impossible: perfect software. I want to provide that, but I'm only human.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •