yes... sm_61 is an emulation of sm_60 P100 fp16 stuff, and when used (look like it is sometimes) its slower than normal ops generated for sm_5x ... So nothing changed for the GTX (and mining cards) about that since CUDA 8
ghostwalker: -d 0 or -d 1
Oh, I didn't know that. It does make sense the poor performance then, since fp16 in P100 is a lot different than in GTX family. I hope they change the emulation somehow in the future.
I'm trying to learn the new tricks using Cuda9 RC. I think the cooperative groups approach might be interesting for a few algos.
I'll send you a PM if I discover anything useful for ccminer, but don't expect too much, since I don't have even 20% of your coding expertise

Congrats and keep up the good work.
Rod