I said we need a "server" because Mesa needs to be splited into two processes for Xeon Phi.
Originally Posted by mateli
One is the library for traditional OpenGL applications who have no idea about Xeon Phi.
Another is a Xeon Phi application, running in a different process located at a different host, because Xeon Phi is a standalone machine connected with PCI-e bus actually.
A brief TODO list for Xeon Phi rendering:
1. Write a Mesa driver on the host (your core i?, athlon, ppc or any cpu you like)
The driver needs to translate OpenGL commands into some intermediate form messages and pass them to the server running on Xeon Phi.
In other words, the state tracker (gallium) is left on the host.
(Sending OpenGL commands directly is possible, but I'd rather run the state tracker on a superscalar processor)
2. Write a OpenGL server on Xeon Phi
The server needs to parse messages and complete the rendering work.
This server can use llvmpipe, but rewriting from scratch is also possible, esp. for some commerical OpenGL vendors.
It's another story to optimize the OpenGL Sever, but I believe Xeon Phi is the furture for OSS high performance 3D.
Sorry, just noticed this now. Even without HSA, most of the programmability is already included in currently shipping GPUs. The main differences are :
Originally Posted by Figueiredo
1. GPUs keep texture filtering in fixed-function hardware rather than moving it to general purpose processors. Texture processing is generally required for small rectangular areas of texture rather than individual pixels, and there are some significant performance & power efficiency benefits to be had from using fixed-function hardware because of the ability to share results from intermediate calculations more efficiently.
2. GPUs handle the task of spreading work across parallel threads and cores using fixed function hardware rather than software, which helps a lot with scaling issues.
Pretty much everything else has already moved from fixed function hardware into the general purpose processors. The ISA on the general purpose processors is a bit more focused on graphics and HPC tasks -- that's what the ISA guide for each new HW generation covers.
In KC-speak, the HD 79xx has 32 independent cores, each with a scalar ALU and a 2048-bit SIMD floating point ALU (organized as 4 x 512-bit, ie 4 x 16-way SIMD), running up to 40 threads on each core. The fixed-function hardware that spreads work across threads and cores allows each of the cores to have relatively more floating point power (which is what graphics and HPC both require) and relatively less scalar power.
What HSA brings is tighter integration between the main (superscalar) CPU cores and the GPU cores to reduce the overhead and programming complexity of offloading work to a separate device -- shared pageable virtual memory, cache coherency between CPU and GPU cores, simpler/faster dispatch of work between GPU and CPU etc...
re: #2, couple more comments for completeness...
For compute, the fixed function hardware takes N-dimensional array-level compute commands and spreads the work across cores & threads.
For graphics, the fixed function hardware takes "draw using these lists of triangles" commands and implements the non-programmable parts of GL/DX graphics pipelines :
- pick out individual vertices and spread the vertex shader processing across cores and threads
- reassemble processed vertices into triangles, scan convert each triangle to identify pixels
- spread the pixel/fragment shader work across cores & threads
(a modern graphics pipeline has a lot more stages than just vertex & fragment processing but you get the idea, same applies to the other stages as well)
Maybe I've learn something wrong but I can't get the idea about "40 threads".
Originally Posted by bridgman
I think the cores in SI run wavefront, not independent threads. Isn't SMT an unnecessary complexity for GPUs?
What I'm calling a thread in KC-speak is a wavefront in GPU-speak. Basically the same thing these days... a single thread using the SIMD hardware to process 64 elements in parallel (each 16-way SIMD actually performs a vector operation on 64 elements in 4 clocks).
Originally Posted by zxy_thf
SMT these days usually refers to dynamically sharing execution units in a superscalar processor, which is complex as you say. GPUs generally rely on thread-level parallelism rather than instruction-level parallelism (although VLIW shader cores use both, with the compiler implementing ILP), so running multiple threads is a lot less complex.
Think about the old "barrel processor" model from the 60s and 70s, where the processor has multiple register sets and switches between threads on a per-clock basis rather than using the parallel execution units required for superscalar operation to run instructions from more than one thread in a single clock.
IIRC Larrabee uses the same approach -- multiple threads per core but only one thread at a time. I'm not sure which model KC uses but I suspect it also runs one thread at a time per core.
Looks like "the old barrel processor model" is now called "fine grained temporal multithreading" and is trendy again :D
Thanks to memory's high latency ;)
Originally Posted by bridgman