But what if you're stuck with a problem set which is inherently branchy? Not every algorithm is perfectly suited to execution on GPUs.
And yes, I agree that 1TFlop/s is a bit low for an absolute performance number.
My big questions now are:
1) What's the power consumption for that 1TFlop. Does it require a multi-slot cooler, or is it a single-slot passive cooler?
2) How's the latency
3) How quickly can they scale that performance up?
Actually, #1 is partially answered in the linked blog post. The image of the card shows a dual-slot cooler with a blower, similar to most mid/high-end graphics cards today.
The other big advantage of this co-processor that is mentioned in the blog post is compatibility. This card will execute x86 (or maybe x86-64) instructions natively, which means that any multi-threaded program that runs on an Intel CPU is a candidate for running on this card. No porting to OpenCL/CUDA/etc required.
I'm curious how long it will be until someone gets llvmpipe working on this
