I wonder if there is a way to optimize it further though? Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.
But since I use secp256k1 curve only for testing and research I do no care much for any of possible vulnerabilities and attacks.
The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.