Re: GTX 680 Perf Estimations

Quote

A hypothetical GTX 580 SM core with 96 shaders@772MHz should still have the same performance of a real SM with 48 shaders@1544MHz, because it could only process 4 instructions per clock cycle (2 warps). A GTX 680 SMX core however is capable of doing 8 instructions per clock cycle (4 warps). It's the number of warp schedulers + shaders to fill that ultimately determines max performance.

Nope. I think you don't quite understand how a GPU functions. Best approximation I can give you is with hyperthreading that has 4 register sets rather than 2. So yes, if there is a memory fetch operation, there would be 3 warps in-flight rather than 1. So yes, Kepler would be good for memory-intensive tasks (which might explain the AES case if they blowed up the lookup tables and did not fit them in __local memory).

But no, there is no 4x instruction throughput. In that aspect, it's the same as Fermi. BTW a warp does not "execute" for just one clock cycle, it executes for much more, more than 20 clock cycles and there is of course a pipeline. With sm_1x architecture, the pipeline was fucked up and new instructions were fetched/retired once per 4 clocks. Fermi impoved that to once per 2 clocks. From what I read in the pdf, Kepler does exactly the same as Fermi. Now the question is, sm_21 archs introduced something like out-of-order execution where similar instructions on different independant data could be "batched". This in turn lead to vectorizing OpenCL code for sm_21 and stuff like GTX460 being ~60% faster when uint2 vectors used. I am really wondering how far did they got with that in GK104