Would appreciate any advice or pointers to resources!
If your code does something like if (point_is_negative) { do_this } else { do_that }, the CPU’s branch predictor leaves breadcrumbs kinda like a trail of clues. And if your code goes if (y < 0) { do_something_faster() }, that’s a dead giveaway, no cap.
Modular inversion (for affine coordinates) is slow as hell (AF), and it’s often variable-time, which is a pain in the neck.
For beginners, the best possible version to check out is JLP SECP256K1. The secp256k1 API doesn’t hand out low-level methods, but this dude basically precomputes G points from 0..255 (both + and -) in batches of 512. I’m not 100% sure about the math behind it, but it’s gotta be faster. Cyclone (which uses his code) cranks out 42M keys/s on 12 threads and 7M keys/s on a single thread.
If you wanna see all the possible versions, peep AlexanderCurl’s GitHub. Dude’s got a solid collection.