It's madness to check everything against everything. If I'd be chasing this (which I'm not), I'd create a checksum for each string of text, sort all checksums, search for duplicates, and use those to check back the original posts. This way you don't have to iterate through all posts, but catch them all at once in a very fast process.
Of course there'll be many false positives and the slightest change in a word would be missed, but this cause would have been caught.
You might be underestimating false positives - it's a killer when you're dealing with 50 million posts. 1% false positive rate would require you to review 500 thousand posts manually, unfeasible. So you probably want to tune the parameters to the point where you have very few false positives even if it comes at a cost of letting some (or many) plagiarists slip through so that you could still catch many others without wading through almost identical bounty reports and other crap.