Page 1 of 2 12 LastLast
Results 1 to 10 of 19

Thread: Intel's Knights Corner Turns Into The Xeon Phi

  1. #1
    Join Date
    Jan 2007
    Posts
    15,391

    Default Intel's Knights Corner Turns Into The Xeon Phi

    Phoronix: Intel's Knights Corner Turns Into The Xeon Phi

    For those that didn't hear yet, Intel's getting ready to ship the Larrabee-derived "Knights Corner" co-processors and they will be marketed under the name of Xeon Phi...

    http://www.phoronix.com/vr.php?view=MTEyMjY

  2. #2
    Join Date
    Oct 2009
    Posts
    2,137

    Default

    This would be intel's answer to gpgpu?

    Edit: Roughly equivalent to a Radeon HD 3870 X2, or a middle-of-the-road NI or SI.
    Hell, my laptop has an otherwise useless discrete GPU not much less than this thing...

    Edit 2: 576 gigaflops. That's a "cheap" laptop with dual GPU.

    Edit 3: Holy hell that's a lot more than my desktop has... 56 GFlops (RHD 4290).
    Last edited by droidhacker; 06-19-2012 at 11:13 AM.

  3. #3
    Join Date
    Apr 2010
    Posts
    1,946

    Default

    This is exactly what I was thinking several years ago.
    Central CPU will be managing tasks, where pluggable CPU modules will provide actuall performance.
    Multiply these modules with OpenCL and you have ray-trace real-time graphics.

  4. #4
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    881

    Default

    Quote Originally Posted by droidhacker View Post
    This would be intel's answer to gpgpu?

    Edit: Roughly equivalent to a Radeon HD 3870 X2, or a middle-of-the-road NI or SI.
    Hell, my laptop has an otherwise useless discrete GPU not much less than this thing...

    Edit 2: 576 gigaflops. That's a "cheap" laptop with dual GPU.

    Edit 3: Holy hell that's a lot more than my desktop has... 56 GFlops (RHD 4290).
    From what I've heard, the Larabee/KC/Phi design should scale better with branchy code, which is something that GPUs really suck at.

    For certain problem sets GPUs are really good and you can get something approximating its maximum stated performance, but the moment you start adding branches that start sending threads in different directions your performance takes a nosedive.

  5. #5
    Join Date
    Oct 2009
    Posts
    2,137

    Default

    Quote Originally Posted by Veerappan View Post
    From what I've heard, the Larabee/KC/Phi design should scale better with branchy code, which is something that GPUs really suck at.

    For certain problem sets GPUs are really good and you can get something approximating its maximum stated performance, but the moment you start adding branches that start sending threads in different directions your performance takes a nosedive.
    So the trick is.... to write good code. Plus, high end GPU's (which certainly cost a lot less than these intel boards...) are already up in the 8 TFlop range.
    Last edited by droidhacker; 06-19-2012 at 11:26 AM.

  6. #6
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    881

    Default

    Quote Originally Posted by droidhacker View Post
    So the trick is.... to write good code. Plus, high end GPU's (which certainly cost a lot less than these intel boards...) are already up in the 8 TFlop range.
    But what if you're stuck with a problem set which is inherently branchy? Not every algorithm is perfectly suited to execution on GPUs.

    And yes, I agree that 1TFlop/s is a bit low for an absolute performance number.

    My big questions now are:
    1) What's the power consumption for that 1TFlop. Does it require a multi-slot cooler, or is it a single-slot passive cooler?
    2) How's the latency
    3) How quickly can they scale that performance up?

    Actually, #1 is partially answered in the linked blog post. The image of the card shows a dual-slot cooler with a blower, similar to most mid/high-end graphics cards today.

    The other big advantage of this co-processor that is mentioned in the blog post is compatibility. This card will execute x86 (or maybe x86-64) instructions natively, which means that any multi-threaded program that runs on an Intel CPU is a candidate for running on this card. No porting to OpenCL/CUDA/etc required.

    I'm curious how long it will be until someone gets llvmpipe working on this

  7. #7
    Join Date
    Oct 2009
    Posts
    2,137

    Default

    Quote Originally Posted by Veerappan View Post
    But what if you're stuck with a problem set which is inherently branchy? Not every algorithm is perfectly suited to execution on GPUs.

    And yes, I agree that 1TFlop/s is a bit low for an absolute performance number.

    My big questions now are:
    1) What's the power consumption for that 1TFlop. Does it require a multi-slot cooler, or is it a single-slot passive cooler?
    2) How's the latency
    3) How quickly can they scale that performance up?

    Actually, #1 is partially answered in the linked blog post. The image of the card shows a dual-slot cooler with a blower, similar to most mid/high-end graphics cards today.

    The other big advantage of this co-processor that is mentioned in the blog post is compatibility. This card will execute x86 (or maybe x86-64) instructions natively, which means that any multi-threaded program that runs on an Intel CPU is a candidate for running on this card. No porting to OpenCL/CUDA/etc required.

    I'm curious how long it will be until someone gets llvmpipe working on this
    I doubt that it will be that simple. If its just a multi-core x86, its going to be a tank and isn't going to differ significantly from a multi-core CPU. If its not a multi-core x86, then there must be some kind of VM to run x86 code, which again makes it a tank for running x86 code.

    Now million dollar question: If the thing is actually able to run x86 natively (without VM) and "massively parallel', why are they building it into a co-processor board, and not adding this directly into CPU's?

    Edit: blog post doesn't actually say that its x86. Just that it *can* run x86 code. Well, an ARM chip CAN run x86 code (VM), just not well. I'm guessing that this is just marketing crap. They also use this buzz word "intel architecture". That is not the same as saying Intel x86 architecture. Intel is responsible for various different architectures, some experimental, some downright failures (ia64). In the end, I will state that you can't take advantage of a massively parallel processor without coding FOR that massively parallel processor. This is similar to trying to take advantage of a multi-core CPU with a single-threaded application. It will run, just won't benefit from it.

    I'm skeptical about what this thing will do and how.
    Last edited by droidhacker; 06-19-2012 at 11:58 AM.

  8. #8
    Join Date
    Dec 2011
    Posts
    74

    Default To the FUD spreaders on this website...

    To all of the people spreading FUD that this card is less powerful than a 5 year old desktop ATI part... GET THE FACTS. People here are comparing theoretical hand-wavy peak performance numbers for *single precision operations* that are never reached in real life to actual certified *real* performance numbers for actual benchmarks achieved in the MIC systems. There is a *WORLD* of difference and if you don't believe me, go read the TOP 500 list and see how many 5 year old AMD GPUs are in there if those parts are supposedly amazing... I can save you some time since the answer is 0.

    Let me put it to you this way: The exact same benchmarks where Intel has already shown MIC working at over 1 Terraflop are the benchmarks where Nvidia is *projecting* that its *full GK110 part* will be when it is finally released. Basically, Nvidia's top-of-the-line next-generation 7+ Billion transistor monster will be in the same league as MIC, but require you to use the CUDA programming model to get the performance. MIC totally destroys any existing compute accelerator on the market, and as a huge bonus the programming model for MIC is light years ahead of having to use CUDA or whatever passes for OpenCL these days in AMD land. The MIC is a fully documented architecture that supplies SIMD instructions that are expanded from the existing AVX instructions already used in Intel & AMD CPUs.

    MIC is a *vastly* more open architecture than anything from Nvidia or AMD. And don't even get me started with the likes of Quaridiot who act like the Messiah has returned when AMD releases incomplete and inaccurate docs for some of its cards long after they have been released for a couple of unpaid volunteers to decipher. MIC is a 100% open documented architecture and Intel has already released open source software for it in advance of its launch. This architecture will hopefully force Nvidia and AMD to *really* take Linux seriously instead of treating it like a second class citizen while trying to make huge $$$ on Linux-based HPC systems.

  9. #9
    Join Date
    Nov 2008
    Location
    Madison, WI, USA
    Posts
    881

    Default

    Quote Originally Posted by droidhacker View Post
    I doubt that it will be that simple. If its just a multi-core x86, its going to be a tank and isn't going to differ significantly from a multi-core CPU. If its not a multi-core x86, then there must be some kind of VM to run x86 code, which again makes it a tank for running x86 code.

    Now million dollar question: If the thing is actually able to run x86 natively (without VM) and "massively parallel', why are they building it into a co-processor board, and not adding this directly into CPU's?

    Edit: blog post doesn't actually say that its x86. Just that it *can* run x86 code. Well, an ARM chip CAN run x86 code (VM), just not well. I'm guessing that this is just marketing crap.
    Wasn't the original idea of Larabee to basically put a large number of simple x86 cores with vector instruction support on a single chip? Basically something like 80+ Atom/Pentium cores. No OOO and probably minimal caches; Very simple x86 cores, but lots of them.

  10. #10
    Join Date
    Jun 2012
    Posts
    5

    Default

    Quote Originally Posted by droidhacker View Post
    This would be intel's answer to gpgpu?

    Edit: Roughly equivalent to a Radeon HD 3870 X2, or a middle-of-the-road NI or SI.
    Hell, my laptop has an otherwise useless discrete GPU not much less than this thing...

    Edit 2: 576 gigaflops. That's a "cheap" laptop with dual GPU.

    Edit 3: Holy hell that's a lot more than my desktop has... 56 GFlops (RHD 4290).
    The article gives the _double_ float precision peak performance not the single float one. You are comparing single float peak performance with double one.

    For example, the AMD 7970 has 947Gflops in double precision. See here
    Last edited by bouliiii; 06-19-2012 at 01:19 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •