Epsylon,
Did you try Cuda 9 RC yet? Looks like __shfl was deprecated in favor of something else. There´s no programming guide for V9 yet so I don't know for sure.
I´m trying to give it a shot to see if they improved compiling for sm_61.
Almost all algos run slower on sm_61 compiled code, comparing to sm_52, which is a bit odd.