It's supposed to hurt. That means you're starting to understand. Congratulations
Ever since the introduction of programmable shaders GPU drivers have included an on-the-fly compilation step (going from, say, GLSL to GPU shader instructions) and the GPU hardware has run many copies of those compiled shader programs in parallel to get acceptable performance.
GPU vendors did a good job of hiding that complexity from the application -- but with OpenCL you get to see all the scary stuff behind the scenes.
Back in 2002 the R300 (aka 9700) was running 8 copies of the pixel shader program in parallel, each working on a different pixel.
Why not have 1 copy of the pixel shader operating on 8 different pixels? like SIMD. What's the rationale behind using so many processors if they are all running the same code?