Hello everyone. I have published my optimized versions of VanitySearch (CUDA) with speed boost in case anyone is interested

The "bitcrack" version is specific to the puzzle and allows searching for addresses and prefixes (compressed) within a given range. The speed is about 6900 MKey/s on a 4090 and 8800 MKey/s on 5090.
The second version, on the other hand, performs a standard search for vanity addresses (not just P2PKH compressed) but with the same optimizations in terms of math and CUDA code. Random searches with endomorphisms.
https://github.com/FixedPaul/VanitySearch-Bitcrackhttps://github.com/FixedPaul/VanitySearchThank you for your work – it's truly impressive! The first program achieves a speed higher than any other solution I've seen. Even with a 33% power limit on an RTX 4090, it reaches around 2.3G keys per second. The second program delivers an even more record-breaking speed of about 4G keys per second under the same power limit. However, unfortunately, these impressive numbers are merely theoretical and not useful for solving puzzles, as the program does not support working with ranges.
I wonder if it is possible to implement Bitcoin address prefix searching not only by the starting characters but also by any other positions within the address. For example, searching for characters at the end, in the middle, or even a combined search where part of the characters are at the beginning, part in the middle, and part at the end, and so on.
Thanks! But why only 2.3 Gkey/s? A 4090 @300W should run at around 5.5 Gkey/s.
As for the second program, as soon as I find some time, I'll also implement search within a specific range there (without endomorphisms, of course), so that it can search within a range rather than randomly—both for prefixes and wildcards, which is what you're asking for, if I understood correctly.

2.3 Gkey/s is only about a third of my 4090’s full potential. Under full load, the speed indeed reaches around 7 Gkey/s, just as you mentioned. And the second program, which doesn’t support range searches, can actually achieve over 10 Gkey/s at full power!
However, I try not to push the GPU to its limits, as excessive heat increases wear on both the GPU and memory. My card doesn’t have liquid cooling, so I don’t see much sense in running it at maximum capacity for prolonged periods, especially for tasks like this. But when the temperature stays within safe limits, I’m comfortable letting it run key searches for many hours.
Regarding your planned program improvements—that would be absolutely fantastic! For now, I handle wildcard prefix searches using an additional filtering script in Python. This script processes the output of the main program, which searches for the first few characters but generates an overwhelming number of unnecessary results. The script intercepts this output and filters only the prefixes I’m interested in, even if they appear in different parts of the address.
Without this filtering, the main program’s result file can grow to gigabytes within minutes. However, this approach—using both the main program (such as yours) and the filtering script—introduces significant limitations on search speed, possibly due to the high-intensity data output.