I know it's a little slower than your tests but I am wondering without changing anything, if this is the best speed I will get on a Windows machine (for my specific CPU, not all of them)...
Have you tried to compile for Windows exe? If so, did your speed dip versus Linux build?
It depends on the CPU. I think 211 kB is larger than the L1 cache, so if you lower the loop count and set the OMP affinity to a single core (maybe through the env vars) I'd say it may run faster.
You can also enable -march=native and various compiler flags like enable/disable AVX512. I experienced different results (better or worse) when playing with these options - for example, with some flags it is much better when running just a single thread, but worse when using more threads, due to AVX bottlenecks.
For specific purposes the speed can go even higher. As it is, both X and Y get computed and converted from 5x52 to 4x64. I would conjecture that a BSGS implementation can be blazing fast with these tweaks:
- Y is not needed (except for the last computed point, which updates the center pivot)
- use directly the 5x52 native representation
Baby steps ("i" index + key hash) would then be stored into the DB (for example the hash can be just the lower 48 bits of X, which is the lower 64-bit limb of the native FE). This will be slow because insertions can't be parallelized.
Giant steps ("j") would then simply run in parallel and do the lookups and collision handling also in parallel (DB is read only). But this requires updating the code to allow a stride size (giant step multiples), which is was, I think, the main reason it was requested by shinji366.
AFAIK this will solve in at most 1.41 * sqrt(b) additions the ECDLP, over a interval of size b, where half of them are the baby steps.
Maybe I will add the stride option in a few weeks and try some BSGS on top. Not a huge priority for me since it doesn't really scale well for high bits.