that is a goog question. is it possible to speed that up, comparing two list with millions of lines?
If you sort the file and use binary search to look for each item in your list then your runtime becomes O(log2(n)) for each for each entry and so you're going to have a maximum of O(NumberOfAddresses*log2(n)) as your worst case runtime. It's really not slow, that's about 30 units of time to search for an address in a list with 1,000,000,000 lines in it.
Actually fitting all that into memory is going to be a problem though. There are some algorithms I read in a Knuth book about on-disk sorting but they're very old and I think they may have to be adapted for hard disks instead of tapes.
You can't search for a h160 address in a 300gb file, you must map the file to binary. Then the file will be 10gb, and it just takes a second to yes/no whether that h160 is in the list of 300M addresses.
Early brainflay github had a tool called 'binchk' using xed ( linux ) you convert the 300gb file to uniq-hex, and get the .bin file, use binchk
Brainflayer needed to use this because the 512mb bloom-filter they use, only allowed about 10M addresses before false-positive went astro, with the binchk you can take the false positives and very if any of them are real postive.
The bloom-filter is super fast, can work on GPU, but the false-positive is high.
No point in using binary search on text, just use the model described.
Today when I scrape the blockchain, I get about 300M addresses, but after you do the "sort -u", it will be slightly less, but you also need to run a client on the memory pool, to constantly add the new addresses to the bloom-filter.
If you comparing lists with 300M lines of hex, and/or search its much better to use bloom-filters and binary search combined, drops the 2+ hour search to seconds.