How did you come up with total number of ops at 2^67.783 ?? That is an odd number that I have not seen before. Interesting.
1.72 * sqrt(2**134) = 2^67.78
I did not include the DP overhead, because it depends on a lot of factors.
For all the haters and frustrated multiple-account-owners pretending I'm Digaran and that I have multiple accounts and so on (even though any mod can confirm the contrary): get a life.
Also, as I stated a lot of times, there are so many ways to improve on efficiency than what is already out there publicly, that only a limited mind can think there's nothing to drastically improve on.
Here's a basic kicker: JLP doesn't use the 3-kangaroo method. So there's 15% increase in efficiency basically for free just by implementing this detail alone. But guys still attack that check.cpp shows some fake speed., and hence the entire algo is broken. Nevermind the absurdidities around "birthday paradox". I'll grab my popcorn and be silent whenever I read again such a wonderful exposition like the earlier one.
Here's another kicker: Kangaroo doesn't rely on the birthday paradox. That would be Gaudry-Schost. Seriously, it is stated in the paper itself that this is the difference between Kangaroo and that one, they have totally different analysis methods. But what can I ask from someone who doesn't understand the principles, and copy pastes whatever bogus shit ChatGPT spills out? Do they think I can't tell the style of GPT output, while they accuse me of using it? Ugly as hell.
In other news, congrats if you really do have 8+ Gk/s on a single 4090. You must have really found some method to speed up the CUDA code, I have some hunches you used Montgomery or something to achieve this. I feel that there still a lot of juice left to squeeze out of those teraflops, but it's not exactly like spinning a magic circle, when we also have a job or life to attend to.