Page 16 of 18 FirstFirst ... 61415161718 LastLast
Results 151 to 160 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

  1. #151
    Join Date
    May 2012
    Posts
    342

    Default

    in this case there isnt much difference between sse and sse2
    sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints

    gcc(4.8) gives me scalar code all the time
    Code:
     67e:   f3 0f 10 4f 80          movss  xmm1,DWORD PTR [rdi-0x80]
     683:   f3 0f 59 46 d0          mulss  xmm0,DWORD PTR [rsi-0x30]
     688:   f3 0f 59 4e d4          mulss  xmm1,DWORD PTR [rsi-0x2c]
     68d:   f3 0f 58 c1             addss  xmm0,xmm1
    i also tried some more specific compiler options like
    gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2

    even with hints in C code
    Code:
    	vertex = __builtin_assume_aligned (vertex, 32);
    	matrix = __builtin_assume_aligned (matrix, 32);
    	result = __builtin_assume_aligned (result, 32);
    avx, fma and xop, on the other hand, have instructions that are great for this kind of operations like VFMADDPS and HADDPS
    the compiler does use them when possible, but from what i tried its still scalar

    threading assembly code is as easy as threading C code
    in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
    i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers

    i tried -flto now and it does give better performance
    maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

    this is for example of sse usefulness
    if someone wants to use the loop in a BLAS library il finish it to work in all cases
    and it can also be used for software fallback when opengl3.x is not available, like with laptops
    fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
    writing a loop here and there isnt that demanding

    debugging is done by following the flow of the program
    load -> shuffle -> load2 -> shuffle_together
    what is and should be in the affected register is written
    you can rename them as you wish
    i admit its not as clear as copying parts of C code
    then again SIMD and MIMD can require different algorithms so its at least good to know about it
    i personally put stores and calls to print to stdout to know whats going on
    (gdb returns an address of the problem, then you can look at the disassembly where is the problem)

    i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
    then again i do this for a hobby so what do i know

    for software production it probably dosent matter at all
    most people have many core cpus so optimizing that couple % dosent matter
    then again sometimes its useful
    like in scientific things, encryption, databases, physics on the cpu
    also on code that changes flow frequently based on the results of calculations, that is better done on the cpu

    unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
    still 177472 * 3 is 532416
    hmm, guess its because of less cache usage
    nice try thou, similar loop is in gcc's vectorization manual

    PS i also tried icc and llvm on this site, llvm is 3.0 thou
    Last edited by gens; 05-03-2013 at 11:59 AM.

  2. #152
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by gens View Post
    unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
    still 177472 * 3 is 532416
    hmm, guess its because of less cache usage
    nice try thou, similar loop is in gcc's vectorization manual

    PS i also tried icc and llvm on this site, llvm is 3.0 thou
    You're right, I mistyped and you're completely right: by 3 times. So with -flto and if you multiply by 3 the computation it will get into very few percents under the assembly.
    As you pointed out in a previous post of yours, this optimization was significant, let's say 30% of your application was executing this loop. So with your 15% speedup, you will make your application to run 4% (30%*15%) overall. If -flto optimizes other functions, not only the tiny loop you've mentioned, you cannot get more than 5% overall, right?

    And one solution is hard-sweating and hard to debug, and my "solution" was not only found broken, but "it does only one third of the calculations" would not be necessary in the first place.

    This said, I am still curious why you can't use glLoadMatrix (that require 0% CPU), or snapshotting (if you have an animation or something like that) where you can optimize many times your CPU usage. Or why not using 4x4 matrices as the compiler maybe will use SSE(2) to sped up the computation. Yeah, I know, too much memory, but still, memory used with real speedup, when your application seem so much locked into huge speed requirements.

    OpenGL is supported on any OS, and glLoadMatrix is available in any graphic card on the market, and if it isn't it is well optimized (with SSE and such) by the OS vendors for the target machine, so there is no reason to you to optimize. It is supported in hardware from first S3 Savage 2000, NVidia Geforce 256 or Ati Radeon. Really history stuff, like 10 years ago.

    Isn't your design faulty, and assembly "backend solution" only tries to hide your problem, instead of solving it?
    Last edited by ciplogic; 05-05-2013 at 09:55 AM.

  3. #153
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by gens View Post
    (...)
    i tried -flto now and it does give better performance
    maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

    this is for example of sse usefulness
    if someone wants to use the loop in a BLAS library il finish it to work in all cases
    and it can also be used for software fallback when opengl3.x is not available, like with laptops
    fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
    writing a loop here and there isnt that demanding


    PS i also tried icc and llvm on this site, llvm is 3.0 thou
    Try this code:

    Code:
    #include <stdio.h>
    #include <sys/time.h>
    
    unsigned long long int rdtsc(void)
    {
       unsigned long long int x;
       unsigned a, d;
    
       __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
    
       return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
    }
    
    float matrices[10000][9];
    float vertices[10000][3];
    float result[10000][3];
    
    void compute(int count ) {
    	int i,j,k;
    	float partial;
    	for( i=0;i<count;i++) {		
    		for(j = 0; j<3; j++)
    		{
    			partial = 0.0f;			
    			for(k = 0; k<3; k++)
    				partial += vertices[i][k] * matrices[i][j*3+k];
    			result[i][j] = partial;		
    		}
    	}
    }
    
    
    int main() {
    	int i;
    	int count = 10000;
    	
    	float tmp=0;
    	for( i=0; i<count*3; i++) {
    		vertices[i/3][i%3]=tmp;
    		tmp=tmp+1;
    	}
    	tmp = 0.0f;	
    	for( i=0; i<count*9; i++) {
    		matrices[i/9][i%9]=tmp;
    		tmp=tmp+1;
    	}
    	unsigned long long ts = rdtsc();
    	
    	compute( count );
    	
    	printf("elapsed ticks: %llu\n", rdtsc() - ts);
    	
    	for( i=0; i<24; i++) {
    		printf("%f ", result[i]);
    	}
    	printf("\n");
    	return 0;
    }
    I am not fully sure, but it looks that is auto-vectorized, or even it isn't, the performance went up (I changed a bit the logic, so change it accordingly, if there are bugs or it doesn't generate the same numbers at the end) and I think that it would be good enough (even is not auto-vectorized):
    Code:
     ./a.out 
    elapsed ticks: 471560
    The original timings were:
    Quote Originally Posted by ciplogic View Post
    So, I rerun the tests as you've suggested:
    Code:
    $ g++ -O3 matrix_test.c matrixm.c 
    $ ./a.out 
    elapsed ticks: 745912
    And here is the kicker:
    Code:
    $ g++ -O3 -flto matrix_test.c matrixm.c 
    $ ./a.out 
    elapsed ticks: 647984
    (...)
    May you confirm the numbers on your machine?
    So this rewrite will speedup on my machine by 58% (745912 /471560 ticks). I think that if you will rewrite the loop similarly in your machine, the code will run faster on your machine than your assembly. I know that machines differ, and maybe it will run slower, or maybe you will take the assembly of the loop that is 58% percent faster and you can find an optimization that happen on Intel Compiler and doesn't happen in GCC case, so you will state: but still the case remains: GCC, given a bug is reported, will be able to fix your loop and all other code written similarly, but your assembly will remain unoptimized, slower than the C version.

  4. #154
    Join Date
    May 2012
    Posts
    342

    Default

    a program that gives wrong results is infinitely slower then any program that does

    in opengl for shaders you need at least opengl3.something
    this kinda thing in games is for fallback, or laptops that have weak gpu's
    and i choose it as a generic example, theres plenty of other math things that can benefit
    encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
    idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

    scalar code can not easily be faster then vectorized
    SIMD does less cache messing, overall less instructions to decode etc.

    unrolled loops also help.. to a point

    i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
    that would give a huge advantage to the custom loop
    maybe il test on the laptop some day

    PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
    also i did simple packing, meaning a couple shuffles can be removed too

    PPS data flow is FAST, till it hits the cache limit
    in this case (didnt calculate) i think it hits it near the end
    that explains the speedup from processing just one third
    Last edited by gens; 05-05-2013 at 08:53 PM.

  5. #155
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by gens View Post
    a program that gives wrong results is infinitely slower then any program that does

    in opengl for shaders you need at least opengl3.something
    this kinda thing in games is for fallback, or laptops that have weak gpu's
    This is where you don't understand how OpenGL works. Let me rephrase: OpenGL historically is used to draw primitives on screen. Any OpenGL 1.1 graphic card compliant will accelerate up-to 8 lights in hardware and geometric transformations you draw on screen in GPU. This is commercially known as Transform&Lighting (introduced with GeForce 256/Savage 2000) and is also part of DirectX7 video cards. So if in your target problem, you have to apply many transformations to millions of vertices, you shouldn't multiply them on CPU, but simply state glLoadMatrix (or glMultiplyMatrix) and the driver/video card combination will do it for you. For millions of points per second, with zero CPU usage.

    The vertex/pixel shaders are small programs that can modify the standard flow of vertices/pixels with your custom processing: for vertices you can compute the particles in your particle generator again with little or no CPU usage, or a wave of a sea using a trigonometric function, or a processing to make the picture blurred.

    But as for your problem, if you can feed the graphic card if you need the final points to be displayed, you don't need to multiply them on the CPU. Even more, if your video card is OpenGL 2.0 compliant (a huge numbers of cards do support this, roughly a DirectX8 card like GeForce 5200+ or a Radeon 9000+, or a newer Intel video integrated video card), you cal load the vertices in the video memory as vertex buffers, and you don't need to make copies from CPU to GPU when you draw them. This is one reason why games with complex graphics run on the phones that are memory bandwidth limited. This is again, for your specific problem with the twist that your final points have to be draw on screen eventually (using OpenGL).


    Quote Originally Posted by gens View Post
    and i choose it as a generic example, theres plenty of other math things that can benefit
    encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
    idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

    scalar code can not easily be faster then vectorized
    SIMD does less cache messing, overall less instructions to decode etc.

    unrolled loops also help.. to a point

    i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
    that would give a huge advantage to the custom loop
    maybe il test on the laptop some day
    If you get a new AMD, most likely you could have a X2, X3 or even a Phenom with 6 cores (to not talk about 8 core AMD), so your 30% speedup isn't it not a real deal? Or maybe an APU? So again, why not targeting those? People will buy more CPUs that support the GPGPU computations or the multi-core code. FWIW even phones today are 2 cores, and tablets tend to go to 4 cores soon.

    Quote Originally Posted by gens View Post
    PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
    also i did simple packing, meaning a couple shuffles can be removed too

    PPS data flow is FAST, till it hits the cache limit
    in this case (didnt calculate) i think it hits it near the end
    that explains the speedup from processing just one third
    Did you tried the loop where matrices are defined like 2 dimensional? Was it auto-vectorized on your machine? Was it faster than on your machine's assembly implementation (it was doing at least the same computation)? Can you give some numbers?

    As I'm working with C# my C/assembly skills are not as great, and for what it worth, I did not know how to add your simple file in my Linux C++ IDE, but this is for a reason. I don't write micro-benchmarks on daily basis, and when I do run them, I run them to target an use-case. And as usual, use-cases involve many components. Many times I noticed that components run slow and in many cases using caching is much practical than assembly: a connection to internet to get a list of users is always way slower than any other component of your system, even written in Python or Ruby. If this is optimized, like saving the users for a time like 1 hour, it simply means that user will be able after a "Loading" screen to get instant UI working.

    This I think is where we differ, you think performance for performance sake, when I think performance if it doesn't hurt the user. Of course, with your approach if applied extensively, the user will maybe have instant running applications, and in my case I will have users that don't have annoying pauses, and even if they appear I try to minimize them with some sake defaults. I think your example shows this: you look for your operation to be faster, when I try to find the whole problem space to find optimization opportunities. At last: you don't seem concerned with assembly, but with assembly as "it cannot get faster than this", and you're in denial when people get code fast enough to not matter even if your assembly is faster or not. The "aligned" (or mainly non-aliased) pointers and cache performance can be made by the compiler with little work on your own, and if users can get into 30% (or even faster, with the latest implementation than your default assembler) of assembly speed range, I think that this proves the point that assembly is irrelevant, in your toy-software example.

    "data flow is FAST, till it hits the cache limit"
    What you're talking about? Data Flow Analysis of the compilers? That enable optimizations? Or processing of the data is faster if just it happen let's say in L1 cache? If is the second, then you are certainly wrong about your program to write in assembly, as 30% speedup can be all lost if you touch L2 or L3, and if you code your C++ code to fit well in L1, you can get bigger performance than your ugly assembly.

  6. #156
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by gens View Post
    a program that gives wrong results is infinitely slower then any program that does
    (...)
    unrolled loops also help.. to a point

    i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)

    So this is my "final" C code that does the same number of compiler improvement with no alignment. Using basically: "-O3 -ffast-math -flto" will run faster than the most aligned code I could write (look on the next block of code)
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <memory.h>
    #include <sys/time.h>
    
    unsigned long long int rdtsc(void)
    {
       unsigned long long int x;
       unsigned a, d;
    
       __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
    
       return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
    }
    
    float matrices[10000][9];
    float vertices[10000][3];
    float result[10000][3];
    
    void compute(int count ) {
    	int i,j,k;
    	float partial;
    	float res[3];
    	for( i=0;i<count;i++) {
    			
    		for(j = 0; j<3; j++)
    		{
    			partial = 0.0f;			
    			for(k = 0; k<3; k++)
    				partial += vertices[i][k] * matrices[i][j*3+k];
    			res[j] = partial;
    		}
    
    		memcpy(&result, &res, sizeof(float)*3);
    	}
    }
    
    
    int main() {
    	int i;
    	int count = 10000;
    	
    	float tmp=0.0f;
    	for( i=0; i<count*3; i++) {
    		vertices[i/3][i%3]=tmp;
    		tmp=tmp+1;
    	}
    	tmp = 0.0f;	
    	for( i=0; i<count*9; i++) {
    		matrices[i/9][i%9]=tmp;
    		tmp=tmp+1;
    	}
    	unsigned long long ts = rdtsc();
    	
    	compute( count );
    	
    	printf("elapsed ticks: %llu\n", rdtsc() - ts);
    	
    	for( i=0; i<24; i++) {
    		printf("%f ", result[i/3][i%3]);
    	}
    	printf("\n");
    	return 0;
    }
    2nd version:
    Code:
    void compute(
    	float * __restrict matrix, 
    	float * __restrict vertex, 
    	float * __restrict result, 
    	int count ) {
    	int i,j,k;
    
    	float *m = __builtin_assume_aligned(matrix, 16);
    	float *v = __builtin_assume_aligned(vertex, 16);
    	float *r = __builtin_assume_aligned(result, 16);
    
    	for( i=0;i<count;i++) {
    		
    		for(j=0;j<3;j++)
    		{
    			float accumulator = 0.0f;
    			for (k=0;k<3;k++)
    				accumulator+= m[j]*v[k];
    			result[j] = accumulator;
    		}
    		
    		m += 9;
    		r += 3;
    		v += 3;
    	}
    }
    
    #include <stdio.h>
    #include <sys/time.h>
    
    unsigned long long int rdtsc(void)
    {
       unsigned long long int x;
       unsigned a, d;
    
       __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
    
       return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
    }
    
    int main() {
    	float matrices[100000];
    	float vertices[100000];
    	float result[100000];
    	int i;
    	int count = 10000;
    	float *ptrmat, *ptrvert, *ptrres;
    	
    	float tmp=0;
    	for( i=0; i<count*3; i++) {
    		matrices[i]=tmp;
    		vertices[i]=tmp;
    		tmp=tmp+1;
    	}
    	
    	ptrmat = &matrices[0];
    	ptrvert = &vertices[0];
    	ptrres = &result[0];
    	
    	unsigned long long ts = rdtsc();
    	
    	compute( ptrmat, ptrvert, ptrres, count );
    	
    	printf("elapsed ticks: %llu\n", rdtsc() - ts);
    	
    	for( i=0; i<24; i++) {
    		printf("%f ", result[i]);
    	}
    	printf("\n");
    	return 0;
    }
    Changed data structure to store bi-dimensional arrays:
    Code:
    $ ./a.out 
    elapsed ticks: 263760
    Same data structure (2nd implementation) with alignment hints and merging of the loops
    Code:
    $ ./a.out 
    elapsed ticks: 331592
    I think that none of the implementation were in fact vectorized, but they run smoking fast and faster (I think, I don't know how to compile the assembly part, using as tool!?) than the +30% speedup. Both run more than 100% faster than the original code I was given by you (that they were running in the 700000 ticks zone). So maybe is time for you to improve your C skills to get to speed with fast C code!?

  7. #157
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by Steph1ani1e View Post
    Virtual machines have come a long way in speed, and with multicores now the norm, managed code can be as fast or sometimes even more than unmanaged code
    Fully agree with you, but you know, imagine if you feel somewhat out of control if the VM doesn't optimize the code you run. I think here it lies the frustration of the "assembly folks" here. Or maybe that it occupy more memory or that it starts a bit slower. These issues can be addressed and they have been for some time, but still the impression that if you stay far away of hardware it means you will miss this micro-management of every instruction makes you feel bad.

    Imagine that you are the boss of the department and you cannot say anyone how to use it's precious time, but just to give directives for all company. This would be C/C++ languages. Imagine then that you are a boss of a division or a subdivision, this is like writing in a VM like Java or C#. Sure, your VM may be inefficient sometimes, but many times will do a really good job. The C++ compilers do an excellent job too.

    And some people may ask: Why would you want to use C# if its slower than C++?, when you can use directly assembly?

    And I think the main reason is that the higher your level is in your department, the bigger the probability you can make more changes in the world. If you are just sweeping and making every small table clean, you cannot change the entire company, but maybe the department your working, but if you manage a division, your division can give a bit worse service at the micro-management level, but they can give to you really clear information like they can notify you as a customer with an SMS, they can arrange everything to met you when you need them, etc. These changes can be made if you are in control, but if you are just trying to optimize that the email will arrive faster, by 10 seconds, instead of being in 40 seconds, it will be in 30, means nothing for user, but the power to have the decision of receiving the email in the first place is really crucial.

  8. #158
    Join Date
    May 2012
    Posts
    342

    Default

    i compiled your code, its still scalar
    only seems few segments are interwoven a bit
    i dont wanna examine the resulting code further

    Code:
    		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
    		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
    		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
    is the simplest way i know of to multiply 3x3 by 3x1 matrices
    note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
    you can segment it into common operations, but i dont think it would help


    glMultMatrix is 4x4 by 4x1
    sse code for that would be just a few lines long as there is no need to shuffle that much
    as far as i know for 3x3 matrices you need shaders
    but idk that much about opengl programing


    hmmm
    by data flow i meant reading and writing to the memory
    when the cpu reads from the memory, it goes thru the L2 cache (probably L3 too if you got it) and thru the L1 cache
    when writing same thing happens
    data goes from registers to Lwhatever and when the cpu finds time it writes it to RAM

    the memory management unit (MMU) tries to plan ahead, deciding what to keep in the cache and what to evict
    its no easy task to do when the cpu just rolls data like an idiot, but it tries its best
    one way we can help it is by writing directly to RAM, bypassing the cache
    problem with that is that its a lot slower then writing to cache, but still its faster then writing to a full cache
    (almost the same goes for reading)

    i'm used to linux having lots of programs to measure that kind of things
    like perf that can count cache misses
    or cachebench to see the limits of... well the cache

    bdw why didnt you say so
    heres a windows(64bit) version of the loop
    hope it works, didnt test it at all


    on a more personal note:
    i dont like OO languages, but i can see why they are useful
    i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
    and i wont tell other people "you have to program in C"
    and it bothers me how people talk that C++/C#/whatever is better
    its all good, but its all different in basic mentality
    bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
    (most other languages are far more abstracted from machine code)

    for end a quote:

    Software efficiency halves every 18 months, compensating Moore’s Law.
    — May’s Law

    taking in effect that Moore's law is no longer as valid as it was, the future is slow
    Last edited by gens; 05-07-2013 at 09:51 AM.

  9. #159
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    394

    Default

    Quote Originally Posted by gens View Post
    i compiled your code, its still scalar
    only seems few segments are interwoven a bit
    i dont wanna examine the resulting code further
    So for you SSE (or using any parallel code) is for the sake of parallelism. The scalar code is faster than the SSE code you wrote, so why use SSE code in the first place!? Did I miss something?

    Quote Originally Posted by gens View Post
    Code:
    		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
    		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
    		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
    is the simplest way i know of to multiply 3x3 by 3x1 matrices
    note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
    you can segment it into common operations, but i dont think it would help


    glMultMatrix is 4x4 by 4x1
    sse code for that would be just a few lines long as there is no need to shuffle that much
    as far as i know for 3x3 matrices you need shaders
    but idk that much about opengl programing
    So let me clarify, you seem to just talk about glLoadMatrix/glMultiplyMatrix that doesn't match your size of matrix... but it seems you never used it. glLoadMatrix is used BEFORE you display something. So, let's say you have 10k points, and ONE matrix to multiply them. You will do write something like that to display them, if you don't display them, you have to write silly C loops or assembly:
    Code:
    glLoadMatrix(a4x4MatrixOfYour3x3Matrix); 
    drawGl(yourPoints);
    If drawGl happens to contain the the vertices in the video card as VBO (Vertex buffer object), these two operations are basically with 0% CPU (you will have to make a 4x4 matrix once version of your 3x3 matrix).

    You can't use glLoadMatrix to do your computation though, and you can't use shaders for this either, you need to use CUDA or OpenCL which is another talk all-together. Anyway, as far as I understand, a real program that works with these many points, most likely will display them eventually, so there is no point to make weird computations to show that assembly is faster than C, but to see how can you use things that people do have on their video card, including the glLoadMatrix call.

    on a more personal note:
    i dont like OO languages, but i can see why they are useful
    i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
    and i wont tell other people "you have to program in C"
    and it bothers me how people talk that C++/C#/whatever is better
    its all good, but its all different in basic mentality
    bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
    (most other languages are far more abstracted from machine code)

    for end a quote:

    Software efficiency halves every 18 months, compensating Moore’s Law.
    — May’s Law

    taking in effect that Moore's law is no longer as valid as it was, the future is slow
    You seem to not like "C++/C#/whatever" because C is a portable assembly and is fast, but what you say makes literally no sense. Yes, there is a mentality of inefficiency inside VMs or higher level languages, but this is not an excuse to say that for this reason we have to target just performance. What about buffer overflows? Or NullPointerException (or a Invalid Read At Address 0x0000000c kind of stuff), don't you want that the runtime to be able to recover after these errors?

    At last: what stops you write fast code with C++? Given all equal C++ is faster than C (was discussed earlier by Google experts and on this topic too) as you have access to assembly, all stuff that C has, + templates that can precompute many things at compile time and do aggressive inlining.

    What stops you write fast code in C#? There are game engines written in C# that run well on my some years old phone.

    I do know both C++ and C# and I cannot say that the language speed decreased 2x times every 18 months (or every 2 years for that matter), and half of the reason is that even the software bloat increased many times, the main slow items are still: the rotating disk drives, the internet, the CD, and so on that have huge latencies (if you don't have an SSD, if you have SSD the performance should be many times better than a rotating disk). Having a Pentium 3/Xp experience in 2001 was in many ways much worse than what you will have today with Windows7/i7-3K CPUs. Maybe they boot in the same time, but the Visual Studio is much more responsive (or Eclipse, or pick your tool), the IDE or the tools give to you much more relevant information, anti-aliasing for fonts, searching for symbols in all solution, many times more screen information/resolution.

    If you analyze speciffic language features, let's say C# Linq, yes, I can agree that it is like 15% slower than iterating using the most optimized form of your loop, but this construct makes it hard (or impossible) to make common mistakes with it. If you suggest that "dynamic" keyword in C# is 10x times slower than a virtual call, I also agree, but (and there is always a but), the way the code that was written before to do the "dynamic" functionality is much more error prone (is like Assembly compared to C# defaut style) or reflection code was really ugly: to maintain, to understand and many times slower than Dynamic.

    At last, why do you carry so much about the ticks? If you do a ls command, do you care how fast the ls can parse the answer, or how fast it gives to you the answer? With this I mean: ls can be optimized to be light on CPU, but it can be optimized to be light on disk accesses, and the 2nd one can be faster than the first, even it occupy maybe much more memory/CPU as it can keep a local cache of INode and it's associated data.

    You say you like Firefox, but also look to the following facts about FF:
    - it is written in C++/COM like coding (they try to remove XPCOM completely, but not to remove C++) so is very OOP
    - the code management solution is Hg, which is written in Python
    - it uses PGO (Profile Guided Optimizations) and they tune their C++ code based on C++ optimization styles, so there is virtually no assembly in all codebase of Firefox
    - it pushes hard JavaScript, and AsmJs is a well behaving subset of fast JS, They bet on the power of compilers as LLVM to give big performance speedups, not on tuning assembly
    - it use SqlLite everywhere (for history, bookmarks, etc.), the way to optimize the slowdowns is to make the code multithreaded
    - most of speedups in the future theme Aurora are based on GPU profiling
    - IonMonkey compiles off-thread the bytecode to native code

    Do you see common themes from what I said before and real software? GPU and multi-threading are the main ways that Firefox took in the last years to optimize the speed. Why not post on FF forums to use assembly everywhere, though I don't think you will have any fans

  10. #160
    Join Date
    Sep 2012
    Posts
    568

    Default

    Quote Originally Posted by gens View Post
    and it bothers me how people talk that C++/C#/whatever is better
    its all good, but its all different in basic mentality
    bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
    (most other languages are far more abstracted from machine code)
    C++ is as fast as C on the C subset, and faster than C on the C++ subset.
    The only things one can dislike about C++ compared to C are:
    - the language can be too complicated for limited hardware (embeded, GPU) and very low level (OS kernels)
    - the language does not enforce any coding style, which can be a pain for a project without strong leadership

    Quote Originally Posted by gens View Post
    for end a quote:

    Software efficiency halves every 18 months, compensating Moore’s Law.
    — May’s Law

    taking in effect that Moore's law is no longer as valid as it was, the future is slow
    There's a reason software efficiency compensate computations speed: it's because there is one thing that doesn't change: the users.
    There is no need to double the fps of games every year, or halve the time needed for a pop-up to appear.
    On the other hand, you can have twice as many games, or games two times better (because you can make game assets more easily if you don't have to optimize them that much, and that is what takes time in a game), or twice as cheap. Your pick! Or application more beautiful, or that can do more, etc, etc.. And all that, still at the same acceptable speed: the user speed.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •