I benchmarked every step of the brute force loop, and most of the compute time goes into the ECDSA curve library secp256k1_ec_pubkey_create() function.
There are numerous places where it's stated by its authors and mentors that the Bitcoin libsecp256k1 library is designed to provide a secure framework for Bitcoin software, not to be the fastest way to do EC math. One of the main features is protection against side-channel attacks, which means that all computations take constant time, with not a lot of shortcuts, if any at all, in order to protect against branch predictions, time-based attacks, etc.
If you want a fast(er)/(est) way to do EC math on a CPU you'll most likely have to roll your own. Slow but easy? Python. Fastish? libsecp256k1. Faster? C and custom arbitrary precision modular arithmetic code. Fastish? assembler / SIMD CPU instructions. Faster than anything so far? GPU parallel executor threads. Fastest? Your own programmed ASIC.