Say for something like XopMC/CudaBrainSecp on GPU - where we have to do point multiplications for all keys, do you know what's the current best implementation? Or do you have any ideas to make it faster? Here, we can't use point additions as pvt keys are unrelated to each other.
I'm not a big fan of point multiplications. But the one in that repo is doing it through projective coordinates. Big boys do it via Jacobian since it uses less field operations (faster). For high capacity multiplications, much larger window sizes can be used, for even less operations. And so on, more optimizations depend on how much data can be reused or precomputed.
A few of these concepts are already implemented in stuff like libsecp256k1. However for GPUs directly working with 8 x 32-bit numbers is the fastest possible aithmetic (while for CPUs, the 5x52 is much faster instead due to the hardware latency while carry-adding the big integer limbs).