Post
Topic
Board Project Development
Re: List of all Bitcoin addresses ever used - currently available on temp location
by
LoyceV
on 24/01/2021, 18:32:18 UTC
Since we're sorting as strings it would actually be:
n = 1 10 11 12 13 14 15 16 19 20 5
k = 18 3 6

The whole list would then become:
all = 1 10 11 12 13 14 15 16 18 19 20 3 5 6
Correct:
Code:
echo '1 10 11 12 13 14 15 16 19 20 5 18 3 6' | tr ' ' '\n' | sort -u | tr '\n' ' '
1 10 11 12 13 14 15 16 18 19 20 3 5 6

I should go back and double check if the outputs are correct, but they seem to be.
I haven't had the time yet to find my bug(s).
For comparison, here's the md5sum for the result from my old code (gunzipped):
Code:
md5sum newchronological.txt
4070c03f974da0ee05ea51084d0f04ac  newchronological.txt

Quote
However, I was a little surprised by the amount of improvement.
Using pigz instead of gzip is interesting indeed. It doesn't use multiple cores to decompress, but it's significantly faster anyway:
Code:
time gunzip -c addresses_sorted.txt.gz | md5sum
real    7m27.541s
time pigz -dc addresses_sorted.txt.gz | md5sum
real    4m35.826s

Sorting takes most of the processing time though:
Code:
time sort -u daily_updates/*.txt > daily-file-long.txt
real    0m36.776s

From 40m13 to 14m29 can't be explained by just using 2 instead of 1 core. Maybe it's more efficient anyway, in that case it will also be worth it on a VPS with only one core. I didn't know it's in the default repository, so I've installed it now. The performance difference is less spectacular on my system:
Code:
time sort -mu <(pigz -dc addresses_sorted.txt.gz) <(sort -u daily-file-long.txt) | pigz > pigz_output.txt.gz
real    31m54.478s

As for file size:
Code:
gzip: 17970427375 bytes
pigz: 17990501927 bytes
The 0.1% size difference is negligible.

Quote
And now... for the main event:
Code:
time gunzip -c addresses_sorted.txt.gz | python3 add_new_adresses_sorted.py | gzip > output.txt.gz
real    39m42.574s
Can you post your add_new_adresses_sorted.py?