Status update (I can give more details if anyone cares):
- Puzzle #1 (8 bits): solved with Python, single thread, around 4000 sig/s (create every TX and privKey from scratch)
- Puzzle #2 (16 bits): solved with Python, same speed as #1
- Puzzle #3 (24 bits): solved with Python, around 200 Ksig/s (optimized TX buffer construction, multithread)
- Puzzle #4 (32 bits): switched to C, allowing SHA state reuse for the first 2 message blocks (the first 128 bytes are always the same) + keeping the same output scriptHash; around 32 Msig/s multithreaded
- Puzzle #5 (40 bits): solved with a laptop RTX 3050 (820 Msig/s, done in 15 minutes) - full SHA caching of the first 192 bytes of the TX bytes (part of it was 3 bytes of nLockTime), and reducing to an amortized single SHA256 round for the message digest (instead of 4). So for example for 2**32 (*2) sig checks, only 1 + 2**24 + 2**32 + SHA rounds are needed (instead of 2**34).
I'm now grinding #6 (48 bits) on a RTX 4090. The speed is insane (much greater than a H160 vanity search). Will be solved by tonight.