Bridgman, how much has your acceleration block changed over generations? Would it be safe for your current DRM to tell how to use R200/R100 video decoding blocks?
Alex, there are no "secret agreements not to expose certain HW functionality". There are "non-secret" agreements that if we offer API support for certified media players we will ensure a certain level of robustness for the associated protection mechanisms. There is also the "non-secret" reality that if we don't offer API support for certified players then we can't sell our chips to major OEMs, which would be spectacularly bad for business.
If we can find ways to expose HW acceleration for open source driver development without putting the implementations on other OSes at risk then that is fine. Right now I am reasonably sure we will be able to do this for the IDCT/MC hardware but not so sure about UVD yet so am saying "no unless you hear otherwise".
Until we have 6xx/7xx 3d engine support up and running this is all academic since the first requirement is getting the back end (render) acceleration in place and working well.
Bridgman, how much has your acceleration block changed over generations? Would it be safe for your current DRM to tell how to use R200/R100 video decoding blocks?
Not much really; synchronization between the IDCT and MC functions changed in R300, rest of the changes were pretty minor. I expect the info we release will enable right back to RV100 aka Radeon 7000.
The biggest demand for IDCT/MC is still coming from 7000 owners and embedded HW designers who used 7000, which makes sense I guess.
It's really just the IDCT block that still needs docs; MC just uses special modes in the 3d engine and that info is already out for 5xx. AFAIK the XvMC API supports MC-only acceleration so someone could start on that now if they had time. MC is still the most computationally expensive stage in the pipe, or at least it is for MPEG2.
Last edited by bridgman; 07-16-2008 at 11:46 AM.
So if I'm doing some H.264 encoding with mencoder would this CAL thing help me to speed up the process ? I'm using a Turion laptop with radeon x1200. Is there any way to try it right now or is it a future thing (and will it work with mencoder)?
Mencoder (and x264) has a threads mechanism ... For example I'm using threads=2 to utilize both cores of my Turion. Would it be possible to speed up the encoding even further by using both the GPU and CPU (with threads=4, and treating the GPU as another core)? I have some doubts that radeon x1200 would be a lot faster then the CPU since it's not a high end GPU.
Last edited by val-gaav; 07-16-2008 at 01:53 PM.
As an enduser I rather have GPU generic API, than HW spesific implementations.
1. Currect Purevideo/AVIO implemetation are quite picky about format, bitrate and resolution and what codec that's supported.
I did quite some testing in windows land, and 20% of my encode's did not play well.
AIK they very specific implemented for HDDVD/Bluray playback, and not an generic video codec API that we need.
2. DRM
We need to focus on using open and generic functionality to ensure we have full controll. This enable us to add filters and other API to the postprosessing. (Upscale, sharpen, deblock etc)
Even if we lose some neat HW based quality improvement we should try reimplement it with an generic GPU angle.
So I'm not so hungry for the core UVD, I rather like new ideas and implementations that an GPU and codec independent.
Just my thoughts as an HTPC junkie.
I found this when I research the subject myself.
The multithreading support for h264 will only work if the h264 stream
was encoded with slices enabled. The multithreading code works by
sending each slice off to a different thread to be decoded rather than
a threaded pipelined approach. Recent builds of the x264 encoder
don't use multiple slices by default any longer so it's quite possible
that your file only has once slice and will only be decoded by one
thread.
No, the GPU won't accelerate the way multithreading does. The GPU doesn't split slices with the CPU, but instead accelerates part of the pipeline. It usually works like this:
stream decode -> IDCT -> MC -> post process
With basic harware acceleration, the CPU does stream decode and IDCT, then sends the resulting stream to the GPU, which does MC (usually the most computationally intensives step) and post-processing (like deinterlacing.) That way, not only does the CPU only have to do the lightweight lifting, but the video stream passed to the GPU is still partially compressed, meaning it uses less of the (valuable) bus bandwidth. More advanced harware acceleration, like what UVD offers, basically accelerates the whole pipeline, meaning the CPU doesn't really have to do anything at all.
If you want to accelerate encoding with the GPU, that's trickier, but possible, given the right hardware, software, and setup. To do this effectively, you'd probably need hardware (like UVD) that accelerates the whole pipeline, or else the video bus would be clogged by large amounts of data travelling back and forth from CPU to GPU.
Right. The GPU instruction set is completely different from the CPU instruction set, so you can't just gently slide work from one to the other. GPUs are massively parallel (an HD48xx can do 800 multiply-add ALU operations per clock in the main shader core while a quad-core CPU can do maybe 4 ALU ops per clock normally or 8-16 per clock using SSE instructions) but the instructions are different, clocks are lower, and the effective IPC rate is a bit lower.
On the other hand, GPUs include hardware to spread work across mulitiple processors and collect the results, which makes programming easier for a specific class of problems (the "stream programming" paradigm).
The interesting thing about doing the entire encode/decode task on the shader core is that it is relatively portable across most modern GPUs, although there are a number of APIs at the same level to consider -- CAL, CUDA, Gallium, OpenCL and DX11 Compute Shaders come to mind immediately.