You want this code: https://github.com/bitcoin-core/secp256k1/pull/507 it will be astronomically faster than your current code.
I believe when I previously implemented the techniques in this code my result was faster than vanitygen on a GPU.
It could also be made faster still with some improvements. E.g. it doesn't actually need to compute the y coordinate of the points, so several field multiplications could be avoided in the gej_to_ge batch conversion. It could also avoid computing the scalar for any given point unless you found a match. (E.g. by splitting the scalar construction part into another function which you don't bother calling unless there is a match).
Another advantage of this code is that it is setup to allow an arbitrary base point. This means you could use untrusted computers to search for you.
Sipa also has AVX2 8-way sha2 and ripemd160 that he might post somewhere if you asked. An 8-way bech32 checksum generator should be really easy to do, though if your expression doesn't match on the final 6 characters you should avoid even running the checksum.