Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 33

Thread: Intel Is Still Working On G45 VA-API Video Acceleration

  1. #21
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    4,987

    Default

    Quote Originally Posted by bridgman View Post
    You normally want a lot more threads than shader cores to cover latency (memory accesses for texture fetches etc..).
    Oh, I thought the number of shader cores was the max amount of gpu threads one could run.

  2. #22
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,385

    Default

    If it wasn't for latency-hiding you would be absolutely correct, and the number of cores does determine the number of threads that can execute in any single clock cycle (2 SIMDs, each executing 8 threads / clock, with up to 5 simultaneous instructions per thread per clock for a total of 80).

    Running more threads than the minimum allows you to make much better use of available memory bandwidth, however, since any thread blocked on a memory access can simply idle while another unblocked thread runs. IIRC this was introduced in the r5xx series, described by marketing as the "Ultra-Threaded Dispatch Processor".

  3. #23
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    862

    Default

    Quote Originally Posted by bridgman View Post
    If it wasn't for latency-hiding you would be absolutely correct, and the number of cores does determine the number of threads that can execute in any single clock cycle (2 SIMDs, each executing 8 threads / clock, with up to 5 simultaneous instructions per thread per clock for a total of 80).

    Running more threads than the minimum allows you to make much better use of available memory bandwidth, however, since any thread blocked on a memory access can simply idle while another unblocked thread runs. IIRC this was introduced in the r5xx series, described by marketing as the "Ultra-Threaded Dispatch Processor".
    What bridgman said.

    The decoder I've got will launch anywhere from 16 to 336 threads at a given time, which would be fine, except for two things:
    1. Memory Access Stalls
    2. Kernel Launch Latency


    When you perform a memory access, you've got several hundred GPU clock cycles you might be waiting around (when reading from the graphics card memory). During this time, the GPU usually tries to swap to another set of threads, similar to hyper-threading on an Intel CPU. The difference being that instead of one thread stalling, an entire group of threads will stall. On Nvidia, threads move together in groups of 32 threads, so a single read will stall 32 threads.

    I'm not sure about AMD as none of my ATI graphics cards (3200/4200/4770) support the byte_addressable_store extension the decoder needs, but given bridgman's number of 8 threads per SIMD, I'm guessing that 8 threads get stalled by a single read, which means up to 40 vector calculations would get stalled.

    The other factor is how long it takes to start running an OpenCL kernel. Normally on a CPU, function calls take a few cycles to fire up. You set the arguments, jump to a new IR value, and then back up any registers you're about to trash.

    In OpenCL you get the additional fun of having to send the function call request into the CL library, which sends it to the graphics driver, which queues it for execution, then sends it over the PCI-express bus to the GPU, and then the GPU finally starts executing it. In most cases, this latency can be fairly low (several tens of clock cycles), but in some cases, such as my laptop (GF9400, Ubuntu 10.10, Nv blob) I've seen latencies which will occasionally spike up to several thousand clock cycles.

    In order to minimize this start-up cost, its usually good to do either a lot of work in a given thread, or launch a TON of threads which would take the CPU a long time to loop over. The danger of the first option is that its hard to write long kernels that don't branch a lot (another performance killer). The second is generally preferred, but as I said, the most I've managed is 336 threads in a single launch, and that's not exactly the common case.

    Usually with Nvidia hardware, you want thousands of threads in flight. With most Nvidia GPUs, they can handle something like 16k-32k simultaneous threads.

  4. #24
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    4,987

    Default

    Out of curiosity, what kind of speeds are you getting now?

  5. #25
    Join Date
    Dec 2007
    Posts
    2,322

    Default

    Certain parts do parallelize well (MC and to a certain extent iDCT). Prior to UVD, we used the 3D engine on those asics for these sorts of tasks.

  6. #26
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    862

    Default

    Quote Originally Posted by curaga View Post
    Out of curiosity, what kind of speeds are you getting now?
    Very low. I haven't had time to parallelize most of the algorithms, just get the CL framework in place to do the work and a direct port of the C code into CL kernels to prove correctness of the output of the ported code. For sub-pixel prediction of inter-coded Macroblocks, it's something like 5-10% of the C speed, 2-5% of the assembly optimized paths. This is doing 16x16, 8x8, 8x4, and 4x4 inter-prediction on the GPU, but one block at a time (far from optimal).

    If I only did 16x16 and 8x8 prediction in CL, the numbers would probably be closer (not currently sure how much closer as then CPU/GPU transfers might be needed then and might provide a bottleneck), as the 16x16 and 8x8 kernels do a lot more work and launch a lot more threads than the 8x4 and 4x4 kernels. This is also only predicting one block within a Macroblock at a time, not batching all of the inter-coded Macroblocks together to launch a multi-dimensional kernel which predicts all blocks of the same size/type simultaneously. That would probably provide an actual improvement over the C code.

    There's definitely still a lot of work to be done before this project gets close to speed parity with the C code.

    As for what I've implemented and am getting correct output from:
    Six-Tap and Bilinear Subpixel Prediction
    IDCT/De-quantization
    Loop Filtering (normal and simple filters)

    The IDCT/Dequant has not gotten much attention, and so it's practically single-threaded, and therefore much slower than the subpixel prediction. The loop filter is also very lowly threaded, 8-16 at most currently, although that could be upped by a factor of at least 3 without too much work.

  7. #27
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    862

    Default

    Not Edit:
    Any more than upping the loop filter by a factor of 3-5x is tough to say, as I haven't really looked at the loop filter algorithm in detail to know if it can be truly threaded the way it needs to be to get good performance and still get correct output.

  8. #28
    Join Date
    Feb 2011
    Posts
    14

    Default

    Quote Originally Posted by gbeauche View Post
    MPEG-2 VLD is already implemented on GMA 4500MHD. H.264 support is being worked on. I think it was also mentioned that VC-1 won't be supported on those older chips.
    ok, but here: http://intellinuxgraphics.org/user.html they state:

    Quote Originally Posted by intellinuxgraphics.org
    Laptop Lenovo T410 Intel HD Graphics Debian Squeeze works out of the box, with Intel 2010Q2 graphics package, Mpeg4 offloading to GPU works fine
    and Lenovo T410 has two versions: one with "Intel GMA X4500 HD" and one with "Nvidia NVS 3100M". so, i don't understand why the information is so contradicting or maybe by "Mpeg4" they don't mean H.264. i have access to X4500 hardware, but i don't have time to test it, but in any case it's really confusing that it seems no one really knows the real state.

  9. #29
    Join Date
    Sep 2010
    Posts
    76

    Default

    If you look to the Chipset item for the T410 at their page it says: "Intel HD Graphics" which is not "GMA 4500"... for the "support is being worked on" probably means you will probably never see it in your lifetime, but who knows (I think you can ask at their irc, I did it some months ago it was really funny).

  10. #30
    Join Date
    Feb 2011
    Posts
    14

    Default

    Quote Originally Posted by rafirafi View Post
    If you look to the Chipset item for the T410 at their page it says: "Intel HD Graphics" which is not "GMA 4500"...
    well, it's really confusing and also misleading - i googled more and it seems there are Lenovo T410 with "GMA 4500" and some new ones with "GMA 5700" and the last is often also called "Intel HD Graphics or GMA HD" in different articles (BTW, it's not build-in the chipset, but rather in the new Intel "Core i" processors). so, it seems the information is probably for T410 with "GMA 5700".

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •