Next Step:
- Optimize CPU/GPU exchange
- Add missing ECC optimizations (some symmetries and endomorphism)
- Add support for GPU funnel shift that should speed up SHA (but I need to find a board with compute capability >3.5, mine is 3.0).
Did you implement already all the steps 1, 2, 3 or there is still space to further improvements?