Since we're sorting as strings it would actually be:
n = 1 10 11 12 13 14 15 16 19 20 5
k = 18 3 6
The whole list would then become:
all = 1 10 11 12 13 14 15 16 18 19 20 3 5 6
Correct:
echo '1 10 11 12 13 14 15 16 19 20 5 18 3 6' | tr ' ' '\n' | sort -u | tr '\n' ' '
1 10 11 12 13 14 15 16 18 19 20 3 5 6
I should go back and double check if the outputs are correct, but they seem to be.
I haven't had the time yet to find my bug(s).
For comparison, here's the md5sum for the result from
my old code (gunzipped):
md5sum newchronological.txt
4070c03f974da0ee05ea51084d0f04ac newchronological.txt
However, I was a little surprised by the amount of improvement.
Using pigz instead of gzip is interesting indeed. It doesn't use multiple cores to decompress, but it's significantly faster anyway:
time gunzip -c addresses_sorted.txt.gz | md5sum
real 7m27.541s
time pigz -dc addresses_sorted.txt.gz | md5sum
real 4m35.826s
Sorting takes most of the processing time though:
time sort -u daily_updates/*.txt > daily-file-long.txt
real 0m36.776s
From 40m13 to 14m29 can't be explained by just using 2 instead of 1 core. Maybe it's more efficient anyway, in that case it will also be worth it on a VPS with only one core. I didn't know it's in the default repository, so I've installed it now. The performance difference is less spectacular on my system:
time sort -mu <(pigz -dc addresses_sorted.txt.gz) <(sort -u daily-file-long.txt) | pigz > pigz_output.txt.gz
real 31m54.478s
As for file size:
gzip: 17970427375 bytes
pigz: 17990501927 bytes
The 0.1% size difference is negligible.
And now... for the main event:
time gunzip -c addresses_sorted.txt.gz | python3 add_new_adresses_sorted.py | gzip > output.txt.gz
real 39m42.574s
Can you post your
add_new_adresses_sorted.py?