This could be an interesting exercise if done on the, say, 100 most prolific and discernibly original posters on the forum. I think I will make the cut in atleast the top 200, if not 200100. Anyways, that an idea right there for the OP to check "Where to draw the line".
Go for it if you want (I don't know how good your researching skills are), or maybe someone like LoyceV or one of the statistics gurus will do it. If someone does do it though, I do hope it doesn't result in good members getting banned--unless they obviously deserve to be.
If i had anywhere near the skills needed to do this, I would probably be a software dev myself and not installing propulsion equipment in train engines, LOL. Its more of an idea for someone with the dev skills to do it. I can then give myself one of those pompous managerial designations like "Research design consultant" or something.
A general algorithm would probably parse through all of the post history and compare it with everyone else's in snippets of 6 words each finding a match percentage. Lots of enumeration, which I was never good at. Then you'd have to root out the edge cases like quotes and references. It can be done and could even be an interesting open source project.
EDIT: Maybe if we had a sort of hackathon bounty for it for building this and some other tools on a platform like Gitcoin. Would be great to see a Bitcointalk Tribe in Gitcoin. See, there i go giving away "ideas" again.What say ya? @Theymos
It is not terribly difficult to remove things such as quotes from posts. Markup (things such as
bold, and links) can also be removed trivially.
Splitting up the text of posts into sets of 6 words will be expensive, but is doable. A text with
n words will have
n - 6 sets of words.
The problem is that it is really not possible to check every new post for plagiarism because the cost of checking an additional post will grow for every additional post written. For example, if there are 100 posts that exist on the forum, the cost of checking a new post against all existing posts is 100 units. Once there are 1000 posts on the forum, the cost of checking a single new post against all existing posts is 1000 units. For each additional post made, it costs one additional unit to check a single additional post. This is obviously not sustainable.