Post
Topic
Board Project Development
Merits 4 from 1 user
Re: List of all Bitcoin addresses ever used - currently available on temp location
by
JustHereReading
on 21/01/2021, 12:41:01 UTC
⭐ Merited by LoyceV (4)
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.

I found a bit of time to write this. Testing it now..

The first results of yesterday's testing look promising. I should go back and double check if the outputs are correct, but they seem to be.

I created a VM with 2 cores/2 threads (so no hyperthreading or whatever AMD's equivalent is called) of my Ryzen 3600 and 512mb of RAM (just because Ubuntu Server, for which I had an ISO handy, wouldn't boot with 256MB). To make the numbers mean anything I first benchmarked your current setup:

Code:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
Loycev: real    51m26.730s
JustHereReading: real 40m13.684s

My cores were pushed to ~50% so unsurprisingly pigz yielded an improvement in my setup. However, I was a little surprised by the amount of improvement.
Code:
time sort -mu <(pigz -dc addresses_sorted.txt.gz) <(sort -u daily-file-long.txt) | pigz > output.txt.gz
real    14m29.865s

And now... for the main event:
Code:
time gunzip -c addresses_sorted.txt.gz | python3 add_new_adresses_sorted.py | gzip > output.txt.gz

real    39m42.574s
The script ran slightly faster than your current setup. In that time it sorted (and compressed) the first list in addition to creating a text file that can be appended to the second list. Unfortunately I overwrote the results of your current setup, so I didn't verify the output (yet).