How on earth did you manage that? We havent been able to get over 13Mh/s
Just by benchmarking various launch configs until I found one that worked well, in addition to the other changes I listed in my original post. I modified the hefty_cpu_hash function in cuda_hefty1.cu. Changes made are expressed in this diff:
https://gist.github.com/danryan/6a631e0ece773e5f6788this change is potentially dangerous as the total number of threads run on the GPU is not aligned with the "throughput" variable as used by the heavycoin scanhash function (passed in as the variable "threads" into the function you modified). This could lead to overlapping shares being found (same nonce leading to rejects), part of the nonce space to be skipped (not actually a problem), or buffers to be overrun (potentially serious).
You need to add some code to compute the throughput variable (=total number of GPU threads) based on device properties, e.g. in an early function call to the cuda_hefty1.cu module.
Christian
Oh geez, yep you're absolutely right. I had altered the loop pattern of the hefty_gpu_hash to consider grids of various widths:
Appears I neglected to commit this code when running my benchmarks. With the loop changed, I no longer got validations errors from alignment issues but also no longer get the performance increases I saw previously.