Results 1 to 10 of 10

Thread: openCL direct port from cuda kills HD4xxx

Hybrid View

  1. #1
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default openCL direct port from cuda kills HD4xxx

    http://forums.amd.com/devforum/messa...hreadid=123857

    what da fu.k...

    "It is possible to get good performance, just not with a direct port from Cuda.(openCL)"

    "with an NVIDIA GTX 260, and another with an ATI 4870. [...]I'm sorry to say we are getting approximately 5x the performance from the NVIDIA card,"

  2. #2
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,385

    Default

    You already found the key point in that post :

    "It is possible to get good performance, just not with a direct port from Cuda.(openCL)"

    Every app is different, of course, but it's possible to write apps which are heavily optimized for a specific vendor's hardware, in which case a "direct port" will still have those optimizations and may not perform well on different hardware.

    Nothing wrong with that if you only plan to run on one hardware type, of course, but developers are learning to follow "generic" best practices (rather than vendor-specific ones) which allow good performance on multiple vendor's hardware, including CPU. What seems to make the most difference is memory access patterns - tweaking the ALU code can give maybe a 100% speedup but you can easily get a 10-1 or better improvement (or worsening) by changing the way memory is used.

    Here's an example that goes the other way - IIRC this app was written in OpenCL from the beginning.

    http://forum.beyond3d.com/showpost.p...6&postcount=77

    In general I think the performance results will lie somewhere in betwteen. If you follow some of the GPGPU threads you can see the cross-platform issues gradually being knocked off so that the final code runs fast on at least three different platforms (CPU as well as NVidia/ATI GPU)
    Last edited by bridgman; 01-05-2010 at 11:43 AM.

  3. #3
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default

    Quote Originally Posted by bridgman View Post
    You already found the key point in that post :

    "It is possible to get good performance, just not with a direct port from Cuda.(openCL)"

    Every app is different, of course, but it's possible to write apps which are heavily optimized for a specific vendor's hardware, in which case a "direct port" will still have those optimizations and may not perform well on different hardware.

    Nothing wrong with that if you only plan to run on one hardware type, of course, but developers are learning to follow "generic" best practices (rather than vendor-specific ones) which allow good performance on multiple vendor's hardware, including CPU. What seems to make the most difference is memory access patterns - tweaking the ALU code can give maybe a 100% speedup but you can easily get a 10-1 or better improvement (or worsening) by changing the way memory is used.

    Here's an example that goes the other way - IIRC this app was written in OpenCL from the beginning.

    http://forum.beyond3d.com/showpost.p...6&postcount=77

    In general I think the performance results will lie somewhere in betwteen. If you follow some of the GPGPU threads you can see the cross-platform issues gradually being knocked off so that the final code runs fast on at least three different platforms (CPU as well as NVidia/ATI GPU)
    "GPU Ray-tracing for OpenCL"

    nice :-) "Sample/sec -- 17,298.6K" is that ok for rendering a game in full Ray-tracing ??

    why nvidia cards are so slow in this benchmark ? ? ? ? ? ? ?


    whats up with openCL for hd3xxx hardware?

    whats up with openCL for the opensource driver?

  4. #4
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,385

    Default

    Quote Originally Posted by Qaridarium View Post
    "Sample/sec -- 17,298.6K" is that ok for rendering a game in full Ray-tracing ??
    I haven't really looked at the code, so not sure how it maps onto real world game workloads. The model being used here is fairly simple - a handful of geometric shapes.

    Quote Originally Posted by Qaridarium View Post
    why nvidia cards are so slow in this benchmark ? ? ? ? ? ? ?
    It probably just happens to be coded in a way that maps better onto ATI strengths (eg math intensive) than NVidia strengths. Revisit the thread in a month and I expect the gap between the two GPU vendors will be smaller, and the code will be running faster on all hardware.

    Quote Originally Posted by Qaridarium View Post
    whats up with openCL for hd3xxx hardware?
    Each new generation has additional inter-thread hardware support and there's a certain level required for a full, fast OpenCL implementation. IIRC the global data share (GDS) was added to rv670 first, and local data share (LDS) was added to rv770 first, and both of them are required for OpenCL.

    It's probably possible to implement an OpenCL subset that runs fine on older hardware but that is probably more likely to happen with the open drivers than with the proprietary stack.

    Quote Originally Posted by Qaridarium View Post
    whats up with openCL for the opensource driver?
    Nothing has changed AFAIK - it'll probably run over Gallium3D drivers, and there may need to be some changes to TGSI before that happens. VMWare's short term priority was getting their SVGA driver ready for production along with the graphics state trackers, so the devs have mostly been working on that instead of OpenCL. Zack's blog is still the best reference AFAIK.

    http://zrusin.blogspot.com/

    The last OpenCL-specific post was in Feb 09. Since then Zack pushed some initial OpenCL state tracker code :

    http://cgit.freedesktop.org/mesa/clover

    Looks like there has been some activity on the tree in the last few days, which is nice to see.

    There were also some recent (a couple of months ago) discussions about TGSI in relation to OpenCL on one of the mailing lists, but I wasn't able to find a reference quickly.
    Last edited by bridgman; 01-05-2010 at 01:02 PM.

  5. #5
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default

    Quote Originally Posted by bridgman View Post
    Each new generation has additional inter-thread hardware support and there's a certain level required for a full, fast OpenCL implementation. IIRC the global data share (GDS) was added to rv670 first, and local data share (LDS) was added to rv770 first, and both of them are required for OpenCL.
    LDS was first in rv770 yes but for openCL the RV770 was wrong or something else to smal or not the right features,..

    this german news side talk abaut the wrong LDS in RV770...

    http://ht4u.net/news/21452_opencl-pe...eiber-updates/

    "Konkret handelt es sich um den Local Data Store, welcher bei AMD mit dem RV770 (Radeon HD 4870) eingeführt wurde. Dieser kleine Speicherbereich, der bei NVIDIA unter dem Namen Shared Memory bekannt ist, zeichnet sich durch besonders geringe Zugriffszeiten aus. <***Jedoch unterliegt der Local Data Store der HD-4000-Serie einigen Einschränkungen, die eine Treiber-Umsetzung für OpenCL in dieser Hinsicht erschweren.***> Aus diesem Grund bilden aktuelle Catalyst-Treiber den "Local Memory" bei HD-4000-Karten derzeit schlicht und einfach komplett auf den deutlich langsameren globalen Speicher ab. Der Local Data Store liegt somit gänzlich brach."

    Einschränkungen=Restrictions

    the FGLRX do not use LDS on rv770 because of this 'Restrictions' hardwarebug?

    in the end you need a HD5xxx for OpenCL....

    if the hd3870 can handle GDS openCL sould run fine just like hd4000 series because you can handle LDS cals by "slower" GDS:

    "Aus diesem Grund bilden aktuelle Catalyst-Treiber den "Local Memory" bei HD-4000-Karten derzeit schlicht und einfach komplett auf den deutlich langsameren globalen Speicher ab."

    hd4xxx in catalyst does this in the same way!!!! there is no LDS used in the 4000 series...

  6. #6
    Join Date
    Aug 2007
    Posts
    6,607

    Default

    It seems the guy overclocks the gfx cards. Not really usefull for that purpose. There should be test apps used to verify correctness. That's why NV works on ecc memory check for next gen quadro cards - to detect (and correct) memory errors. For games usually only visual checks are done to prove that overclocking is working right, but that can lead to completely useless results on gpu computing.

  7. #7
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,385

    Default

    That doesn't sound right. It's possible that LDS is not used to implement OpenCL global memory (the fit there isn't great) but IIRC it is used for something, probably synchronization. If LDS isn't being used then the main alternatives would be direct shader access to memory (which was expanded a lot in 7xx) or "global" GPRs, which were only added in 7xx, and either way I don't see any reason to think that implementation on 6xx would be easy.

    The graphics-related programming model didn't change much between 6xx and 7xx, but the compute-related parts changed a lot more. Evergreen has non-trivial changes in both areas -- you can see a summary in the front of the ISA guide.

  8. #8
    Join Date
    Jan 2010
    Posts
    4

    Default

    Quote Originally Posted by bridgman View Post
    That doesn't sound right. It's possible that LDS is not used to implement OpenCL global memory (the fit there isn't great) but IIRC it is used for something, probably synchronization. If LDS isn't being used then the main alternatives would be direct shader access to memory (which was expanded a lot in 7xx) or "global" GPRs, which were only added in 7xx, and either way I don't see any reason to think that implementation on 6xx would be easy.
    To be (hopefully) precise, the LDS in RV770 isn't used in OCL, and will probably never be used there, as its access model is too restrictive to fit the specification. My understanding of your implementation is that you emulate shared memory via global memory in RV770, so implementations relying on "heavy" shared memory usage will be a bit underperforming there. On the other hand, one has a pretty fat register file there, which can offset some of the pain (such aspects do make direct ports from CUDA less than great ideas, as more often than not you'd be using shared mem on G80+). There's no GDS in R6xx parts (again, IIRC), there is a memory R/W cache that's similar in certain aspects, but it's small-ish and not wholly equivalent.

    As for an R6xx implementation, lack of a Compute Shader mode(this was added with RV770) could be a limitation, since it'd mean somewhat higher overhead for launching kernels (you'd run them as Pixel Shaders). However, given the current state of the software stack and aspects of the architecture, one often ends up pretty fast implementing compute via Pixel Shaders, so the main hold-up for older parts, IMHO, is lack of relevancy coupled with lack of resources. ATI could probably coherce R6xx parts into OCL compliancy, but I fail to see why that would provide any benefit whatsoever, whilst it'd cost their already over-stretched Compute centric guys a fair chunk of time.

  9. #9
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,385

    Default

    Quote Originally Posted by Alex_V View Post
    There's no GDS in R6xx parts (again, IIRC), there is a memory R/W cache that's similar in certain aspects, but it's small-ish and not wholly equivalent.
    Yeah, that matches what I'm seeing in the documentation, although "conventional wisdom on the internet" seems to be that rv670 at least did have the GDS. Maybe confusion between GDS and the scatter/gather memory access functionality.

    I'm pretty sure there was als a hardware issue related to synchronization on older parts; I'll see if I can remember what it was.

    It's probably obvious that the open source team has been focusing on the graphics functionality so far, and not the compute bits
    Last edited by bridgman; 01-06-2010 at 02:45 AM.

  10. #10
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    5,411

    Default

    Quote Originally Posted by bridgman View Post
    Yeah, that matches what I'm seeing in the documentation, although "conventional wisdom on the internet" seems to be that rv670 at least did have the GDS. Maybe confusion between GDS and the scatter/gather memory access functionality.

    I'm pretty sure there was als a hardware issue related to synchronization on older parts; I'll see if I can remember what it was.

    It's probably obvious that the open source team has been focusing on the graphics functionality so far, and not the compute bits
    "To be (hopefully) precise, the LDS in RV770 isn't used in OCL, and will probably never be used there, as its access model is too restrictive to fit the specification."

    in the end there is no used LDS in the RV770............ :-(

    nvidia kicks my ass pure evil apple/nvidia satanic circle.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •