RTX 3060 (non Ti) = 2,400 MKey/s (2.4GKey/s)
RTX 3600 Ti = 2,800 MKey/s (2.8GKey/s)
RTX 3070 = 3,100 MKey/s (3.1GKey/s)
RTX 3090 = 5,250 MKey/s (5.2GKey/s)
RTX 4090 = 7,750 MKey/s (7.7 GKey/s) (yes, you read that right lol)
Great work

Your speed test without power limits ?
I make few fixes and now see small speed up.
now RTX 3070(pl 170w) = 3217 Mkeys/s
I calc only dp32 for #125
https://gcdnb.pbrd.co/images/8QF3YGOUdTJS.png?o=1