It can be done by
awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read
30 quadrillion bytes. Either way, I can't test it due to lack of RAM... it's not that expensive to use it a couple hours per month only.
You are on AWS, right? Why not have your sponsor upgrade your instance to a higher class for a few hours? That's the beauty of on-demand processing.
