Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "
sort -mu":
-m, --merge
merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular"
sort.
Update: I'm testing this
now.
Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
I do see that for the other list it might be a bit more difficult...
It can be done by
awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read
30 quadrillion bytes. Either way, I can't test it due to lack of RAM.
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a
bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
Some quick math:
1GB: 1 in 13 false positives
2GB: 1 in ~170
3GB: 1 in ~2,200
4GB: 1 in ~28,000
5GB: 1 in ~365,000
6GB: 1 in ~4,700,000
7GB: 1 in ~61,000,000
8GB: 1 in ~800,000,000Of course this would require some hashing overhead, but this should greatly outweigh looping over your 1.5 billion addresses. Unfortunately you'd still have to double check any positives, because they might be false.
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
This would definitely work and was the solution I originally proposed:
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
This would be faster than the bloom filter if there's more than 1 new address that's already in the list.
By the way, I just checked out (but not downloaded) the daily file on blockchair. It's close to 1GB (compressed), but you mentioned 20MB for new addresses on numerous occasions. I guess there's a lot of cleaning to do there. Could I maybe get one of your (old) daily files? I should be able to throw some code together that makes this work, fairly quickly.