Hello,
I would like to thanks arulbero who gave me by MP a great tip to improve speed by MP using some symmetries

I missed this, shame on me.
It will save few modular mult. But however, ~40% of cpu is used for modular mult, other 60% mainly go to SHA,RIPE,Base58,ModInv and byteswapping, so I don't know if I can reach the 2.0MKey/s (x 1.66)
For linux (cpu side), I have to work on code generation optimization but assembly using AT&T syntax makes me crazy.
As reference for SHA and RIPE, you could look here:
https://github.com/klynastor/supervanitygenI don't use Base58 in my code, because I need only address in hex format, not Base58.
When an OpenCL implementation?

EDIT: on cpu 40% is used for ecc arithmetic; on gpu? I'm curious.