Really curious how that test works out. I do hope it does a little bit more than just merge the file and not sort them.
It merges all lines from both sorted files in sorted order. After several tests (on my old desktop with HDD), these are the relevant results:
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real 90m2.883s
Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real 51m26.730s
The output is the same.
Interestingly, when I tell
sort -m to use up to 40% of my RAM, it actually uses that (even though it doesn't need it), which slows it down by 7 minutes.
Most CPU time is spent compressing the new gzip file.
That's a significant improvement. You could give pigz a try, see:
https://unix.stackexchange.com/a/88739/314660. I'm not sure what the drawbacks would be, I"ve never tried pigz myself.
I think you wrote that you'd need about 256GB of RAM for that operation, right? Sorry... can't help you out there. However a
bloomfilter might be nice to implement if you have a 'bit' of RAM (a lot less than 256GB).
That's going over my head, and probably far too complicated for something this simple.
Honestly, the bloomfilter was a silly suggestion. It will probably not be a big improvement (if any) compared to your current code.
Thanks! Hoping to do some experimenting soon (if I have the time...)