Re: Checking brainwallet

you will need to use a hashing algorithm that can handle a sufficiently large number of items

I didn't expect SHA256 to be so slow. I just tested a simple bash loop, and get less than 1000 hashes per second. That won't scale well to 40 billion inputs.

As @vjudeu said, you used un-optimized tool. You should use optimized tool such as hashcat. Even on single core VPS, i managed to get ~5000 kH/s.

Currently I work on piece of software for someone who has lost access to his brainwallet. He knows the way how to produce the phrase, but as it contains some (unknown) number, we must check many possible phrases. The list of potential addresses is very small (just a few), so it is not a factor which slows us down - it is rather priv->pub generation. Initial sha256 and later Hash160 are very fast. The main problem would be to compare generated hash160 against huge database of addresses (& of course generation for both compressed and uncompressed keys). At this stage of project we have speed 47Mkey on RTX 3060 and a little more than 100Mkeys on 3090 - but we are progressing every day Wink

I have developped a tool like this with about the same performance than PawGo one (80 MKeys on a RTX 3070) .
The solution to check on a big list of hash160 is to create a bloomfilter.
For my experiment i build a bloomfilter of about 80Mega hash160 (typically the size of the utxos db).
the binary size is about 1.2 GB with an error rate of 1e-16. (that means that u will have a false positiv every 10e16 lookup in the bloomfilter).
You just have to load this bloomfilter in the GPU RAM to speed up the process.
The lookup in the filter is very fast compared to the SHA256("....)->EC multiplication->hash160 of the publickey process.
So with my tool u can check about 100M hash160 at the speed of 80M sha256(password candidate) on a RTX3070.

If tou well define the pattern of the brainwallet you want to find. I can help u

That's a pretty impressive speed - do you have any special tricks to make the EC multiplication fast?

UFFF ok it was a lot of work but im aware to share:

this is the best cuda grid/block config i found for the RTX 3070 (48 SM cores)

#define GRID_SIZE 1024*4*5
#define BLOCK_SIZE 256

I precompute a table of kG value of 16*65535 values with x and y indexed by

0001->FFFF
0001 0000->FFFF 0000
0001 0000 0000->FFFF 0000 0000
...
I load this 2 table in shared memory of the GPU at the start of the program

so for a 256 scalar multiplication you just have to decompose your 256bits (k in k.G) in 16 parts of 16 bits and to do 16 sequential points addition
you can do the addition in jacobian coordinate value given by the (x,y,1) + (x',y',z') to avoid slow inversion (i think this is the most speed up trick)

The final inversion is done with the Fast Modular Inversion (Delayed Right Shift 62 bits)

the final inversion is made with