Quoting myself:
Old code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 | gzip > newchronological.txt.gz
New code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > new.alladdresses_chronological.txt
But it's wrongI tried again:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 > oldcode.txt
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > newcode.txt
The files were too large to
diff, so I split them in parts (10 million lines each). There are a few differences:
< 3BS6oQKHDwrzz4RC69iAbSV13xpbGZvXLj
---
> 3BS6oQKHDwrzz4SC69iAbSV13xpbGZvXLj
< 17Q7LN9nCmS6HdjkDj3C4MdhduFobGp4hv
---
> 17Q7LN9nCmS6HdjkDk3C4MdhduFobGp4hv
< 1rVH156qu1djPVFGoKaZ29Kw8zEpmh283
9863597a9863597
> 1rVH156qu1djPVFGoKaZ29Kw8zEpmh283
< 3Kw9pkLTLExTd9LZW2qbbNUdZRpUW3JTac
---
> 3Kw9pkLTLExTd9LZW2qbbNUdZSpUW3JTac
< 1Q7NSpgjxDHTTPUkGskTNDioCYw6MQazBG
---
> 1Q7NSpgjyDHTTPUkGskTNDioCYw6MQazBG
< 331xujHAg6AGvKzPwUKZ9AJxukaemCXeRw
8496039c8496038
< 3PLoa4ccMdxyGY6mAEStSu45xwqdRftd1b
> 3PLoa4ccMdxyGY6mAEStSu55xwqdRftd1b
10000000a10000000
> 3B92y4bFFPZvjviNhtLWeBKoYXmHVwr3CD
< 3B92y4bFFPZvjviNhtLWeBKoYXmHVwr3CD
7735278a7735278
> 331xujHAg6AGvKzPwUKZ9AJxukaemCXeRw
Let's highlight this one:
< 1Q7NSpgjxDHTTPUkGskTNDioCYw6MQazBG
---
> 1Q7NSpgjyDHTTPUkGskTNDioCYw6MQazBG
The first one (with
x)
is correct.
I have no idea what causes this. I'm now checking if I can reproduce the exact same data change, or that it's caused by hardware failure.