It's supposed to hurt. That means you're starting to understand. Congratulations
Ever since the introduction of programmable shaders GPU drivers have included an on-the-fly compilation step (going from, say, GLSL to GPU shader instructions) and the GPU hardware has run many copies of those compiled shader programs in parallel to get acceptable performance.
GPU vendors did a good job of hiding that complexity from the application -- but with OpenCL you get to see all the scary stuff behind the scenes.
Back in 2002 the R300 (aka 9700) was running 8 copies of the pixel shader program in parallel, each working on a different pixel. The RV730 is comparable in terms of pixel throughput but can run 64 copies of a shader program in parallel, ie the ratio of shader power to pixel-pushing power is 4-8 times higher on the RV730. This is why modern chips can run so much *faster* on complex 3D applications even if they run *slower* on glxgears.
Unified shader GPUs use multiple shader blocks in order to handle the mix of vertex, geometry and pixel shader work that comes with a single drawing task. In principle the blocks could be designed to work on totally different tasks but that would require a lot more silicon (more $$) and the added complexity would probably *reduce* overall performance.
The most important concept to grasp is that with conventional programming you have a single task, executing a program which steps through an array and calculates the results for each element. With data-parallel programming you write a program that calculates the value of ONE element, then the OpenCL / Stream / CUDA runtime executes a copy of the program for each element in the array, using parallel hardware as much as possible.
Having the runtime take care of parallelism (rather than the application) makes it possible for an application to run on anything from a single-core CPU to a stack of GPUs without recompilation.