You already found the key point in that post :
"It is possible to get good performance, just not with a direct port from Cuda.(openCL)"
Every app is different, of course, but it's possible to write apps which are heavily optimized for a specific vendor's hardware, in which case a "direct port" will still have those optimizations and may not perform well on different hardware.
Nothing wrong with that if you only plan to run on one hardware type, of course, but developers are learning to follow "generic" best practices (rather than vendor-specific ones) which allow good performance on multiple vendor's hardware, including CPU. What seems to make the most difference is memory access patterns - tweaking the ALU code can give maybe a 100% speedup but you can easily get a 10-1 or better improvement (or worsening) by changing the way memory is used.
Here's an example that goes the other way - IIRC this app was written in OpenCL from the beginning.
http://forum.beyond3d.com/showpost.p...6&postcount=77
In general I think the performance results will lie somewhere in betwteen. If you follow some of the GPGPU threads you can see the cross-platform issues gradually being knocked off so that the final code runs fast on at least three different platforms (CPU as well as NVidia/ATI GPU)


Reply With Quote
