That's great and all, but I'm not sure your estimations fit the context of what you've quoted.
I'm well aware that there are a lot of optimizations for scalar multiplication, and it's great you have really fast code to do it in CUDA as fast as you state (which sounds miraculous in itself to be honest), but there's one thing you maybe missed: the private keys are not random, they're fed as input. Yeah, maybe redo the math after crawling private keys from GPU memory

Not sure why you had to bring smth from page 520 of that topic here though. It'd be much more interesting if you instead update us on 135 progress

Yeah may be I missed some context, I did not check carefully. But anyway, reading 0.5GK/s privkeys is only 16GB/s which is almost nothing for 4090 bandwidth.
I don't read "blah-blah" thread (it's disgusting), but sometimes when I'm bored I read your messages because you have some skills

You also need to write 16 GB/s as well, since it's not a one-shot job.
So to compute even those minimum 250 MK/s, the first wall to pass is reading a file from disk at least at 8 GB/s, before even talking about GPUs and their memory bandwidth.