A group size of 512 does not bring significant improvement (less than 1%). The DRS62 ModInv is fast and almost negligible with a group size of 256.
If you have a modular mult faster than the digit serial Montgomery mult on a 256bit field, I'm obviously fully open. A folding does not improve thing on 256 bit when working with 64bit digits. I'm not sure if Barrett could be faster, I must say I didn't try and for "medium size field", there can be traps.
On my pc:
VanitySearch -stop -u -t 1 1tryme --> 1,2 MKeys/s
my ecc library --> 2,0 MKeys/s (17 M Public keys/s)
EDIT:
I use:
a) group of 4096 points
b) a * b = c mod p a*b --> 8 * 64 bit, then first 4 limbs * (2**256 - p) + lower 4 limbs.
c) exploit some properties of secp256k1 curve