Yes and there is also a CUDA intrinsic that search for the number of starting zero __ffsll which could be used to speed up the checking of public key.
Thanks for the information.
CPU profiles of the last release:
...
A note:
ModInv <-> ModMulK1
6% <-> 13% (uncompressed)
10,3% <-> 18,9% (compressed)
it seems to me that ModInv should be much lower than 50% of ModMulk1, are you sure you don't take significative advantage from using more than 256 elements for each batch? Why don't try with 1024 or 4096?