Page 2 of 2 FirstFirst 12
Results 11 to 20 of 20

Thread: Rootbeer: A High-Performance GPU Compiler For Java

  1. #11
    Join Date
    Jun 2009
    Posts
    1,191

    Default

    Quote Originally Posted by alexThunder View Post
    Well, actually they are, unless you got a problem which can not be parallelized.
    They actually have most of that, i.e. pipelining.
    most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

    they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

    this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems

  2. #12
    Join Date
    Sep 2011
    Posts
    191

    Default

    Quote Originally Posted by jrch2k8 View Post
    most algorithms are very hard to parallelize and those parallel friendly algoritm need optimizations depending on the GPU and the dataset you use [<-- is a very hard task -- wanna a widespread reference google CABAC GPU]

    they have them but very rudimentary and optimized for the GPU tasks so they differ quite a lot from a CPU counterpart, don't believe me try a matrix multiply[1000x1000 for example] 1 with branching and 1 without [pick the CL language you like] and check the time both take to complete [non branched wins for a factor of X times] so you see what i mean.

    this is what i meaned when i say none is faster than the other they are different tools designed to atack efficiently very different scale of problems
    Fortunately I don't have to test this again. I ripped some pages out of the lecture I heard about it and combined them for you:

    http://www.uploadarea.de/files/2g1ae...gv83mxiyyj.pdf

    It's in german, but that's not that important. There is (some part) of the actual host programm and the OpenCL kernel. On the last two pages you'll find some graphics, which show how the (very) simple kernel performs against a sequential CPU proramm, one with PThreads (4 core machine with HT), OpenCL CPU and OpenCL GPU. The last page shows the perforamnce of the GPU kernel after some optimizations (better usage of the memory).

    (Only the most simpel kernel is one these pages, not the optimized one)

    The y-Axis shows the time in seconds. The program was tested with a 1680x1680 matrix on a i7 920 and a Geforce 9800 GX2

    FYI: The lecture I attended isn't publicly online anymore, but the most recent one is: http://pvs.uni-muenster.de/pvs/lehre...vorlesung.html (german)

  3. #13
    Join Date
    Nov 2008
    Posts
    783

    Default

    Quote Originally Posted by alexThunder View Post
    Well, actually they are, unless you got a problem which can not be parallelized.
    They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.

    Is it because they're said to have more FLOPS? Sure they do, in theory. Now compare BOPS, Branching Operations per Second, and see your GPU weep.

    There are quite a few requirements for a workload to be GPU-suitable:
    * it has to be massively parallelizable
    * all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
    * no communication between the threads.
    * it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
    * There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
    * The whole workload must take long enough to overcome the overhead of setting up the GPU.

    Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.

  4. #14
    Join Date
    Sep 2011
    Posts
    191

    Default

    Quote Originally Posted by rohcQaH View Post
    They are if and only if you have a GPU-suitable workload. If you don't, they're slower. Not sure why you insist otherwise.
    I don't. In general my point is, that they're just faster, although they might not be usable for everything.

    For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?

    Quote Originally Posted by rohcQaH View Post
    * it has to be massively parallelizable
    Leave out the "massively". If problems are not that well parallelizable, it might still be enough to outperform the CPU by a significant degree.

    Quote Originally Posted by rohcQaH View Post
    * all parallel threads must be homogenous. CPUs can easily run a physics thread on one core and a gameplay thread on a second, which isn't easy to do on GPUs.
    Ever heard of PhysX? It's used in some games. Guess, where this is executed.

    Quote Originally Posted by rohcQaH View Post
    * no communication between the threads.
    And how would you do synchronization then?

    Quote Originally Posted by rohcQaH View Post
    * it should contain as few branches as possible and simple data structures. GPUs are a lot less forgiving if your cache locality sucks.
    Yes, although the number of branches and using local memory are not necessarily related (in terms of performance).
    Still, if you look at the PDF I uploaded, the graphics on the last page show naive OpenCL implementation, usage of warp sizes and usage of local memory/cache (in that order).
    Without locality, it's still fast.

    Quote Originally Posted by rohcQaH View Post
    * There must not be any latency requirements or actual streaming of data. You send all the input, you wait, you get all the output. No partial data anywhere.
    * The whole workload must take long enough to overcome the overhead of setting up the GPU.
    Yes.

    Quote Originally Posted by rohcQaH View Post
    Of course a 40-second task with the textbook-algorithm for parallelizability is going to end up faster on the GPU. That doesn't prove anything for the general case.
    Right, therefore you should do a bit more than that ;P

  5. #15
    Join Date
    Nov 2008
    Posts
    783

    Default

    Quote Originally Posted by alexThunder View Post
    For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
    With cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side. Thus, none can be declared the winner.

    Quote Originally Posted by alexThunder View Post
    Ever heard of PhysX? It's used in some games. Guess, where this is executed.
    It can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.

    On the CPU, both could run concurrently on their own cores with virtually no overhead.

    (Aside: there is evidence that PhysX could be faster on the CPU, but nVidia has purposefully crippled the CPU implementation to make their GPU compute look better.)

    Quote Originally Posted by alexThunder View Post
    And how would you do synchronization then?
    On a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
    Which disables quite a bit of parallel algorithms.

  6. #16
    Join Date
    Sep 2011
    Posts
    191

    Default

    Quote Originally Posted by rohcQaH View Post
    With cars you have a pretty strict metric of speed = distance over time. Which metric are you using to claim that GPUs are faster than CPUs? "Computation over time" is just too hard to define, and you'll find definitions that favor either side.
    Would you understand this comparison without that information? Or do I have to explain how stylistic devices work?

    Btw. it's time to response.

    Quote Originally Posted by rohcQaH View Post
    Thus, none can be declared the winner.
    Which is why we still have SIMD and MIMD devices.

    Quote Originally Posted by rohcQaH View Post
    It can run on either. If it does run on the GPU, it does not run concurrently with other GPU threads. They're run one after the other, with those expensive context switches and CPU involvement in between.

    On the CPU, both could run concurrently on their own cores with virtually no overhead.
    Right, and which one still runs faster?

    Quote Originally Posted by rohcQaH View Post
    On a GPU? You don't. The only synchronization primitive is "The CPU task is informed that the current batch of data has been processed and the results are ready."
    Which disables quite a bit of parallel algorithms.
    Then tell me, what do i.e. local/global memory fences do in OpenCL or what they're good for.

  7. #17
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    884

    Default

    Quote Originally Posted by alexThunder View Post
    Then tell me, what do i.e. local/global memory fences do in OpenCL or what they're good for.
    Memory fences and barrier(CLK_GLOBAL_MEM_FENCE) can only synchronize things within a given WORKGROUP. The GLOBAL is in reference to the global memory space, not synchronizing all GPU threads.

    They are useful for making sure that threads within a workgroup don't get out of sync, but they CANNOT be used to attempt to synchronize all global work items in an OpenCL kernel invocation. Trust me, I've tried this, and I've tried to create global synchronization mechanisms in OpenCL (and found fun ways to lock up my GPU in the process).

  8. #18

    Default

    Nice discussion!

  9. #19
    Join Date
    May 2007
    Location
    Third Rock from the Sun
    Posts
    6,587

    Default

    Quote Originally Posted by alexThunder View Post
    I don't. In general my point is, that they're just faster, although they might not be usable for everything.

    For instance: Motorbikes are usually faster than cars. The fact that they hardly can carry anything compared to a car, doesn't make them slower, does it?
    If you were to continue on the vehicle analogy, a GPU would be more like a truck and the CPU more like a motorbike. It seems to me like both of you are correct in saying which is faster but are both going by a bit different definition as to what "faster" means. A GPU is "faster" in the sense that it can a crap load of parallel code at the same time, on the other hand the CPU is "faster" on a per thread basis.

  10. #20

    Default

    Nice discusstion here! I am learning Java!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •