Could push it even to 60MH per 1080Ti with some slight modifications.
Still funny that the 32bit compiled version is around 10% faster for me than the 64bit one..still try to figure out and debug.
I did the test here on a 1080 and I noticed that in fact the 32 bit version is a bit faster than the 64 bit version.
It is around 1.5~2 MH/s faster. 36.x MH/s in x64 to 38.x in x32, with -i 23.