Post
Topic
Board Project Development
Re: List of all Bitcoin addresses ever used - currently available on temp location
by
JustHereReading
on 16/01/2021, 09:20:19 UTC
Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Code:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real    90m2.883s

Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real    51m26.730s
The output is the same.
Interestingly, when I tell sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.
That's a significant improvement. You could give pigz a try, see: https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.

Quote
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.
Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code.

I use Blockchair's daily outputs to update this, not the daily list of addresses.
See: http://blockdata.loyce.club/alladdresses/daily_updates/ for old daily files.
Thanks! Hoping to do some experimenting soon (if I have the time...)