Epsylon,
Did you try Cuda 9 RC yet? Looks like __shfl was deprecated in favor of something else. There´s no programming guide for V9 yet so I don't know for sure.
I´m trying to give it a shot to see if they improved compiling for sm_61.
Almost all algos run slower on sm_61 compiled code, comparing to sm_52, which is a bit odd.
yes... sm_61 is an emulation of sm_60 P100 fp16 stuff, and when used (look like it is sometimes) its slower than normal ops generated for sm_5x ... So nothing changed for the GTX (and mining cards) about that since CUDA 8
ghostwalker: -d 0 or -d 1