While we're waiting for RTX 5090 here's some really fast jumper for 64-bit CPUs.
I’m working on a Pollard’s Kangaroo implementation for secp256k1 and I’d love to achieve high performance for point arithmetic on CPU (in particular, large-scale multiplications of G and other points). Could you please share or publish your HPC‐optimized code and techniques? I’m especially interested in any optimized field/group operations, batched inversions, or other CPU‐level optimizations you’ve used to speed up these computations.
Why would you need large-scale multiplications of G, it's only needed to create the initial kangaroos. Anyway, you can use libsecp256k1 for that or extract relevant code from it, like I did. You can also optimize further the code I posted, like keeping Y2 always in negated form and caching the jump index. More than that IDK if there's more to do on a CPU, as the batched inversion I presented is already the "parallel" tree-based version (hence the tradeoff with double-size tree storage, to avoid race conditions and r/w overlaps), not the "serial" version. Also, a RTX 4090 is 1000 times faster than a single-core high-end CPU, so my optimizations start there, not for CPU code.