I expect a small performance improvement in CPU-bound apps, because copying buffers using the CP DMA instead of streamout (=OpenGL transform feedback, we have been using it up to now) takes less space in the command stream, doesn't change any 3D states, and doesn't need any other resources (streamout needs an auxiliary buffer where the FILLED_SIZE register is stored). Also streamout requires 4-byte alignment. The CP DMA doesn't have that limitation.
Well the text mixed up quite different things and isn't correct at all.
It might be a bit confusing but we got two DMA engines on modern radeon hardware: An ASYNC DMA and a SYNC DMA!
The CP DMA Marek is using is the SYNC DMA engine which runs in the same ring (or maybe let's call it "the same thread", cause that a term software devs usually understands better) as the rendering engine. So when you just want to copy data from A to B in between two rendering operations you use the CP DMA.
Jerome is working on patches for the ASYNC DMA engine, which (for example) should be used for uploading texture data from the application to VRAM, cause that isn't something we usually do in between rendering operations.
I just had the feeling that I should clarify that.