
Originally Posted by
bridgman
If it wasn't for latency-hiding you would be absolutely correct, and the number of cores does determine the number of threads that can execute in any single clock cycle (2 SIMDs, each executing 8 threads / clock, with up to 5 simultaneous instructions per thread per clock for a total of 80).
Running more threads than the minimum allows you to make much better use of available memory bandwidth, however, since any thread blocked on a memory access can simply idle while another unblocked thread runs. IIRC this was introduced in the r5xx series, described by marketing as the "Ultra-Threaded Dispatch Processor".