In fact, the initial trivial example was really serving the guy really badly,not because the compiler will write this code, but also do_stuff() method or code doesn't make any warranties, like for example that RCX is preserved. So if the function do_stuff() changes RCX and sets it to 10, his trivial optimization would break the code and never finishes.
Originally Posted by frign
Also, as it would be assembly, his optimization would exclude the possibility that the compiler can inline do_stuff(); method (if is a method).
If is inlined/or stuff is code, the compiler will do other things, like will see things that don't depend on i variable and they will be moved outside of the loop (Loop-Invariant-Code-Motion optimization). After that it may find that the loop has a formula that can be computed at compile time as it is a constant expression and it will take 0 ms (no CPU type) as the compiler will be able to compute the expression. Yet sometimes, as the value to iterate is bigger than 66, the compiler can take the decision to split the loop in sequences of 4 like this:
and maybe the 4 stuff can be auto-vectorized or even is not, it removes 3 branches (which are costly even in out-of-order CPUs).
for(auto i=0; i<63; )
What I'm talking is what any compiler at -O3 level does it today (Visual Studio, Clang or GCC) so is a stupid decision to write assembly (as it is shown).