If you manage to push it a little over 30MKeys/s (like by using Neon instruction set) then you can match GPU performance.
In fact this speed is achieved with a modified version of the secp256k1 library of bitcoin core where every verification instructions have been removed for field and group function.
The library field_5*52.h (instead of 10*26 for x86 cpu) (limbs registers) is automatically include for 64bits architecture and you've right it's maybe 10x faster.
I will study the NEON instructions to improve but it will be difficult because this benchmark doesn't include the DP detection code.
It's only a pure random walk benchmark on 1 Billion points.