The warp/work group sizes can drastically vary between hardware, and the ideal code can as well (vector programming vs other methods). During program startup, it is possible to compile the OpenCL kernels and run quick performance tests to pick an ideal method, but that assumes that you are willing to write the auto-tuning code and also to write multiple codepaths.
But you are right. If you write code that works on one OpenCL device (e.g. Nvidia), it should work on another device (CPU, DSP, AMD card, etc). There are extensions that can come into play, but as long as the device you are trying to execute on supports what you need, it should at least execute and produce results.
Performance tuning of OpenCL code is affected by the specific hardware you're running on, but the code should at least execute properly on other devices.



Reply With Quote