Post
Topic
Board Bitcoin Discussion
Re: Bitcoin puzzle transaction ~32 BTC prize to who solves it
by
Bram24732
on 06/04/2025, 06:48:04 UTC
Wait a minute. If he just said that the 3060 achieves around 2300 Mkeys/s, how much does the 4090 achieve? Does it reach 8000 Mkeys/s on the RTX 4090? 8 GK/s ?  In Cyclone GPU ?

Why won't anyone share the fastest GPU code here? Are you hiding the best code for yourselves?  It's all just empty talk and blah blah blah.. Tongue

Because no one is obliged to put their work in the public domain for everyone to see. They spend their time and energy on it.
Because I am doing it right now:) The main target - to be twice faster than KeyhuntCUDA, but it is possible only with PTX ASM. And also if somebody knows an algo of Modular inverse faster that DRS62 - let me know. this is the main goal for me. Or stupidly do all of the code with PTX, that impossible for me

If you want it to be faster you have to change the high level parameters of perception and adjust for the specific device characteristics. VanitySearch is more like a port from CPU to CUDA rather than a parallel-thought problem. So all clones follow the same philosophy more or less.

Micro-optimizations with PTX may end up producing the same final SASS as plain C. You might lose a year to optimize some inverse only to find it works slower than initially on some random new GPU, and faster on another. I have 3 versions (SafeGCD, BinGCD and the one by RC), each of them runs better or worse depending on whether I change a single line of code in a totally different part of the kernel source. So it's more a game of luck to have a perfect faster kernel, depending on whether the compiler decides or not to maybe not spill an extra register just because you swapped two lines that are logically non-dependent Smiley

Full ptx operations (including inverse…) mitigates the compiler cat and mouse a lot. In my experience 96 regs is the sweet spot for occupancy on this kernel. 64 is too aggressive. It might be luck but my kernel works quite well on all architectures with minimal tweaks