Holy crap!!
I am becoming to think that nvidia GPUs, are somewhat easier to command(program) than Radeons are. I have no other explanation for this performance boost without any nvidia support, or documentation. Maybe some one of the Radeon developers will clear the situation.

Actually they are!

AMDs architecture takes packets of 5 ( or 4 on Cayman) Commands per Streaming Processor.
The Nvidia equivalent takes only one command.

So the AMD driver needs to find 5 (4) Commands to be packaged together.
The Nvidia driver just needs to give them fire.