You probably configured the grid incorrectly. The Nsight shows that 4 blocks are working simultaneously. That's why I have a grid of 88*128. And yes, the speed is +/- correct since it matches the amount of DP accumulated over a certain period of time.
Nah. After looking better at the specs of 1660S I see it has higher SM clock frequency and larger memory bus width than what I was comparing against, so that explains some things.
@kTimesG you have written more than once that your program is several times faster than JLP and other clones. Can you really run 1660 Super at 2Gk/s?
I have no idea, I can't test on that. I can currently squeeze out 6.2 Gk/s on a RTX 4090, but some users here claim they can obtain 8 Gk/s or more. I think RetiredCoder might have an even faster version. Technically, it is plausible, it really depends on how well the kernel is implemented. And I said some time ago - if someone manages to fully parallelize the inversion, we can have a doubling in speed

That b***h is really time-consuming.