I have almost the same script with Jacobian coordinates in C++ but using #include <boost/multiprecision/cpp_int.hpp>
It's not a million per second, but it's very close.

Are you sure about that? If my script runs at 225k op/s in shitty Python interpreted code, a native version won't run 4 times faster. It would run 50 times faster. On a single core.
I'm too tired to prove this. Using
libsecp256k1 instead of GMP or whatever, and doing the same batched addition as in my script (which is the core reason of the performance multiplier) reaches 10 M+ op/s, no questions asked. So on an i9 with 6 performance cores it does 60M op/s, and also the rest of 14 efficiency cores each contribute an additional 8M op/s. So a total of around 170 M op/s. But this is futile, it's just a
complete waste of electricity.