Hello,
I would like to present a new bitcoin prefix address finder called VanitySearch. It is very similar to Vanitygen.
The main differences with Vanitygen are that VanitySearch is not using the heavy OpenSSL for CPU calculation and that the kernel is written in Cuda in order to take full advantage of inline PTX assembly.
On my Intel Core i7-4770, VanitySearch runs ~4 times faster than vanitygen64. (1.32 Mkey/s -> 5.27 MK/s)
On my GeForce GTX 645, VanitySearch runs ~1.5 times faster than oclvanitygen. (9.26 Mkey/s -> 14.548 MK/s)
If you want to compare VanitySearch and Vanitygen result, use the -u option for searching uncompressed address.
There is still lots of improvement to do.
Feel free to test it and to submit issue.
Are you using affine or jacobian coordinates for the points?