I've tried my first version using openssl and got about 25,000 keys/s (single-thread). You seem to use something other than openssl.
Unlike hash operations, elliptic curve operations have unpredictable machine cycle count. Thus, speed-up on SIMD processors (e.g. GPUs) won't be great.
I get the same performance as you, but EC_POINT_add isn't the slowest step, it's the call to EC_POINT_point2oct.