The work over Gallium3D was with MPEG2, using the XvMC API, mostly by Younes Manton on Nouveau :
http://bitblitter.blogspot.com/
Cooper then got a good chunk of that code running on the 300g ATI driver before getting dragged off to other projects.
Somewhere in there a video API was defined and at least partially implemented, not exactly sure who did what there.
I don't think we have any good power efficiency numbers yet re: whether CPU or GPU shaders do the offloadable work more efficiently. First priority was offloading enough work to the GPU so that the remainder could be handled by a single CPU thread, since the MT version of the CPU codecs wasn't very mature, and without the ability to use multiple CPU cores anything near 100% of a single core meant frame dropping and other yukkies.
Since then, multithread decoders seem to have become more stable (at least more people seem to be using them), so the pull for GPU decoding has dropped somewhat. I don't know the status of the MT codecs right now, ie whether they are easily accessible to all users or whether they still need a skilled user to build and tweak 'em.