Hello,
Affine coordinates for search (faster):
Each group perform p = startP + i*G, i in [1..group_size] where i*G is a pre-computed table containing G,2G,3G,.... in affine coordinates. The inversion of deltax (dx1-dx2) is done once per group (1 ModInv and 256*3 mult). group_size is 256 key long.
Protective coordinates for EC multiplication (computation of starting keys). Normalization of the key is done after the multiplication for starting key.
Edit:
You also may have noticed that I have an innovative implementation of modular inversion (DRS62) which is almost 2 times faster than the Montgomery one. Some benchmark and comments are available in IntMop.cpp.