Post
Topic
Board Meta
Re: HOLY FUCK !!!! THE SCAMMER IS ALSO A PLAGIARIST OF GIANT MAGNITUDE - SIG BAN NO
by
PrimeNumber7
on 22/05/2020, 15:14:17 UTC
To demonstrate the cost of checking for plagiarism:
If you check every group of 5 consecutive words in each post in a user's post history against every group of 5 consecutive words in every other post that exists:
Assuming the userbase has made 50 million posts, 40 million of which (80%) has at least 5 words, and the average post length is 15 words.
Each post you check for plagiarism would cause you to make 11 queries, and each query would check against 440 million rows in your database. So each post you check would need to be compared against 4.84 billion rows.
It's madness to check everything against everything. If I'd be chasing this (which I'm not), I'd create a checksum for each string of text, sort all checksums, search for duplicates, and use those to check back the original posts. This way you don't have to iterate through all posts, but catch them all at once in a very fast process.
Of course there'll be many false positives and the slightest change in a word would be missed, but this cause would have been caught.
What you are describing would only detect entire posts that are copied. If someone added a single word, the post would not be flagged as a duplicate. I have also noticed that some plagiarists like to copy parts of posts/content, sometimes from different sources. Also sorting as you describe is just another way of checking every post against every other post. This is how you would put everything in the correct order.

You could index parts of the text (for example the first word or the first letter of the first word) to reduce runtime, but the cost to check each post against every other post is still expensive.