b) a * b = c mod p a*b --> 8 * 64 bit, then first 4 limbs * (2**256 - p) + lower 4 limbs.
I tried this. ~same performance as the multiplication by P (for secpk1) for mmult can be reduced in a single 64bit mult. So I'm interested in c.
OK, on linux, performace are still bad, i'm sorry. Some problem with intrinsic....