Quote Originally Posted by WillyThePimp View Post
Let's go with the antithesis. Let's say we should expect no (significant) gains on either platform from compiling with non-vectorized SIMD instructions. Neither platform is using a 64 bits userspace, nor SIMD, anyways. This is for compatibility's sake, of course. If Canonical wants its OS on ARM devices, they have to support the most basic feature: A FP unit, because as I said, a so much of a leadeing SoC as Tegra 2 is, it hasn't got NEON. That's why I'm sure no SIMD instrucctions were used on the ARM machine.

Also, VFP and NEON are two very elegant SIMD instruction sets. While we cannot claim NEON implementation superiority over SSE(x), doing so the other way is equally wrong, it's a lie. Anyways, the default SSE2 in x86_64 is the first SSE that introduced, as far as I know, double precision formats for integer and fp operations, but VFP, in the other hand, is baseline for every modern ARM core and supports it. Shall we compare SSE vs NEON on 32 bits kernel? I'm pretty sure Atom is gonna keep loosing. ANd this is, with a much more mature support for it's architecture at compiler level, overall better system specs and higher power consumption.
While SSE2 is a double percision IEEE754 compliant SIMD unit, VFP isn't. Actually VFP isn't even a SIMD unit. VFP does vector ops by sequencing scalar ones.
On the other hand, the NEON instruction set doesn't have double precision instructions and its single precision is not fully IEEE754 compliant. Other disadvantages of NEON that i can think of are a shared register file with VFP while SSE has it's own registers (XMM) and moving a value from a NEON/VFP register to an ARM register is very slow, causing a 20 cycle pipeline stall.
So VFP is nowhere near as fast as SSE2 and NEON has much more limited use compared to SSE2.