that is a goog question. is it possible to speed that up, comparing two list with millions of lines?
If you sort the file and use binary search to look for each item in your list then your runtime becomes O(log2(n)) for each for each entry and so you're going to have a maximum of O(NumberOfAddresses*log2(n)) as your worst case runtime. It's really not slow, that's about 30 units of time to search for an address in a list with 1,000,000,000 lines in it.
Actually fitting all that into memory is going to be a problem though. There are some algorithms I read in a Knuth book about on-disk sorting but they're very old and I think they may have to be adapted for hard disks instead of tapes.
You can't search for a h160 address in a 300gb file, you must map the file to binary. Then the file will be 10gb, and it just takes a second to yes/no whether that h160 is in the list of 300M addresses.
Early brainflay github had a tool called 'binchk' using xed ( linux ) you convert the 300gb file to uniq-hex, and get the .bin file, use binchk
Brainflayer needed to use this because the 512mb bloom-filter they use, only allowed about 10M addresses before false-positive went astro, with the binchk you can take the false positives and very if any of them are real postive.
The bloom-filter is super fast, can work on GPU, but the false-positive is high.
No point in using binary search on text, just use the model described.
Today when I scrape the blockchain, I get about 300M addresses, but after you do the "sort -u", it will be slightly less, but you also need to run a client on the memory pool, to constantly add the new addresses to the bloom-filter.
If you comparing lists with 300M lines of hex, and/or search its much better to use bloom-filters and binary search combined, drops the 2+ hour search to seconds.
So you are storing and searching 300 Million addresses? How many of those have a balance? Are you just looking for a H160 collision? Curious to know why you are doing what you are doing; doesn't seem like the typical trying to find balances.
I'm seeing 10,000MB-sec rate of addr-priv compare on a rack of 4 rtx-3070, check 300M all at once, means I'm running 30*10^20/sec addr-priv key compares, birthday problems says 50% prob in a field of 2^128
You hit lots of addresses with value, problem is most addresses have 0.05 or less, and the number of addresses over 1 is 1,000's, and the number of addresses with say 0.5 or more is probably 10,000, the odds of finding 1/10,000 in 1/10^38 is rather nil
You can find dust, but I don't do this for money, I dont' even have an exchange account, I'm just doing this for commutational-fun
I think if you 'scale' that is, if somebody has a GPU farm, and they re-purpose dozens of GPU racks to this, they could probably find the majority of value quite quickly.
The problem is there are only a few thousand super high-value addressses left, 5 years ago there were 1,000's of satoshi virgin 50+ btc public-keys, today there less than 900 and dropping monthly, so somebody is trading out these addresses, and it ain't satoshi
...
To answer your question my god, nobody is storing 300M addresses, learn what a bloom-filter is & does, its a way of compressing 300M address ( 300GB hex-text file ) into a 16GB binary file that can say/no in nano-seconds rather than hours
It's really frustrating on this forum because most people don't bother to learn, and the moderators censor anything that rises the minions; The majority of legacy bitcoin-core is dedicated to status-quo, so sure they'll all just get old & die, thank god for young people. "They" don't want anybody to know the real, the entire BTC paradigm is lies, on lies, and anybody that say's the emperor has no clothes is silenced. Too many vested interests, how is this any different than the people who run the Federal Reserve Bank???