I bet you find the 2MB Scrypt is the dominant time factor.
I bet you'll find (on a CPU) the 64MB Scrypt takes twelve times as long as the 2MB Scrypt which takes ten times as long as the 128K Scrypt, so that when the 128K Scrypt is run 120 times interleaved with the 2MB Scrypt that is run 12 times and the 64MB Scrypt that is run once, they each take 1/3 of the time required for the overall hash.
I meant the largest memory Scrypt you are using. So if 64MB, then it would use the most time agreed if you were running them each the same number of times. Now I see you are running the shorter Scrypts more times to compensate. And still the 6GB on the GPU means it can run 96 threads on the 64MB Scrypt to defeat memory latency (and take advantage of the 10 times higher main memory bandwidth in the GPU), whereas the CPU can't. So 2/3 of your scrypts the GPU can trounce the CPU by 100 to 1000 times in theory.
I am expecting roughly, 0.33 x 10 + 0.67 x 100 (or 1000), so roughly the GPU 70 - 700 times faster.
I will get back to you when I have something concrete to grab your attention, even if it proves I was wrong, so that the issue is resolved.
Thanks, looking forward to it. Consider also that the clock is ticking for any GPU implementations. Should have 90% minted within a year, and 10% the next year. It would be regrettable if GPUs or ASICs were to capture the 10% and/or the 2%/year distribution, but not catastrophic.
I am thinking 99% of the 2% per annum could go to GPUs.
Tangentially, I don't understand how you expect to maintain interest in your coin for enough years for it to gain traction, when 90% will already will awarded in one year. You need new adopters continuously, not just the few who got in early.