It doesn't need to be linear, because the FLOPS cost in GPUs is so much lower than in a CPU system.
It appears to me that what happens in a GPU (which is why Intel's hyperthreading is faster than just 4 hardware cores) is that when there are many logical threads, then thread blocks on main memory latency are not a factor, because some other thread can run which has already loaded its main memory access into cache. Thus the GPU is always able to achieve the 200+GB/s main memory throughput, because the latency is masked by the probability of numerous threads.
So it's not just that the improvements in adding cores are not linear . . . it's that they appear to be converging to an upper value, suggesting that they're hitting a limitation other than cycles per second. I'm assuming that's related to memory access, either cache memory size or main memory access. I'm assuming too that a GPU will hit that limit too, and any increased performance because of wider memory bus, speed or cache would be offset by the slower cores.
You make other good points and I don't have the time to give them the attention they deserve immediately. I'm sure there are improvements to be made in the hashing algorithm to make it more GPU resistant, but I'm more concerned with having one that is 'good enough' rather than a perfect one that might fork the coin or cause loss of momentum.
I am going to avoid talking about your holistic design because we have some disagreement,
Yes, there's a lot we disagree on. Have you considered implementing your ideas in a new coin? That's what I did when I decided all the other coins had it wrong!