The compiler usually seems to be able to optimize to the point where the algorithm is running fetch-limited, ie where further ALU optimization would not make a difference. Tweaking for a specific architecture (whether ours or someone else) usually seems to focus on optimizing memory accesses more than ALU operations.
There are probably exceptions where tweaking the code to match the ALU architecture can get a speedup but in general it seems that optimizing I/O is what makes the biggest difference on all architectures these days.