Single core python script...294k keys/s
riiiight, so either the script is running on
dozens of single cores, or is leveraging cupy/pycuda/etc., or is orchestrating the cuda backend instantiation, or..?
as in "An Aes Sedai never lies, but the truth she speaks, may not be the truth you think you hear."

or perhaps you forgot to x6 your speed for the endo and x1000 for the milliseconds

I jest, but that speed seems hard to swallow when I'm running at several times that and have several orders of magnitude less results