Wouldn't the same exact restrictions on GPU applies to KC then? In fact it would pretty much apply to all large parallel architectures.
A modern-day GPU like 7970 has 2048 SPs and since in order to keep the hardware utilized you need at least 4 wavefronts/CU, you should schedule the kernel with a NDRange of at least 8192. That makes 8192*16 = 131GB of video memory which is far beyond what 7970 actually has. KC has 50 cores, assuming you use SSE2 registers (cause AVX does not have integer arithmetic on ymms - it would come with AVX2), then you end up with the equivalent of 200 GPU workitems or 200*16=3200MB of memory, quite within practical limits.
Just what do you think bitcoin mining is?
bitcoin mining by definition has nothing to do with parallel computing. You could do it on a slow single-core MIPS processor for example, of course it would not be cost-effective.
You clearly do not understand how computer architecture or dependency resolution works. Waiting on the results of a previous instruction? Just start work on a non-dependent instruction.
Provided that there is a non-dependent instruction to fetch/decode/execute.
The current bitcoin mining hardware is restricted by the number of ALUs. True there might be a point where you need to increase the registers, decoders, scheduling to feed all those ALUs, but that's not until you increase the ALU resources drastically from where they are now.
Of course it is limited by the number of ALU units, but that's not the only limitation. It is also limited by occupancy. Occupancy is limited by the GPR usage. Even a well-optimized kernel uses enough GPRs to limit the number of wavefronts/CU. That's why for example on VLIW hardware, 2-component vectors worked best. uint2 does not provide enough independent instructions to utilize the VLIW4/VLIW5 bundles and so ALUPacking was far from 100%. On the other hand, going to say uint4 while improving ALUPacking, ironically worsens performance because it requires more GPRs thus less wavefronts can be scheduled and we have less occupancy, underutilizing the hardware. AMD could make their hardware much better suited to bitcoin mining (and not only) if they increased their GPUS' register file, but they decided that would be enough. Generally yes, putting more ALUs would make the hardware faster but also having more GPRs per CU would definitely make it faster too. There isn't much use in more ALUs if you can't keep them busy. Right now, bitcoin kernels are a compromise and the hardware is never completely utilized.
Even better example being NVidia's Kepler. 680GTX has 1536 ALUs, that's three times more than a fast Fermi GPU like 580GTX. Anyway, practical results show 580 is faster than 680 at bitcoin mining (and any other ALU-intensive GPGPU work for that matter). The reason? They went from grouping 32 cores in a CU to 192 cores but instead of increasing 6 times the register file, they did that just 3 times and you end up with having 2 times less registers than you used to have with Fermi. The result being you can't have proper occupancy and alas - the 3x increase in ALUs is practically money for nothing. Kepler is not a GPGPU arch and GK110 unfortunately is not diverging away from that.