The actual (EC) throughput with AOCC (clang) is between 10 Mkeys/s and 12 Mkeys/s per core, using the raw Bitcoin Core secp256k1 library without any hashing. You can't go faster than that on a CPU.

What's the frequency of a single core?
Are you doing affine (not jacobian) batched addition, using the secp256k1_fe_* primitives? And the P (+ - ) Q trick?