The way you parallelize these kinds of searches is to work with several different private key starting points in parallel, not trying to do several things at once at the low level. A fast FPGA implementation would have several of these curve adders alongside each other, each set up to work on different inputs.
Hello,
Thanks for all the tips! Yes, i also think this is the way. Currently running 9 in parallel per chip and need a huge heatsink, even for slow 50Mhz clock.