I am mostly interested in the tool - the bot - used to actually report the plagiarists.
The who is mostly irrelevant, unless the who talks about said tool he/she/it uses.
I'm not sure if my "prime suspect" has been mentioned yet, but feel free to review an older thread on the topic:
https://bitcointalk.org/index.php?topic=5032322I am not sure who your "prime suspect" is, but I found a "confession" in that thread:
I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).
I experimented with n-grams a little bit and couldn't find a good value. Low n yields too many false positives, high n doesn't detect spinners, etc. So I'm using a mixture of algorithms and base the decision on the pattern of the results of those algorithms - e.g. if the similarity of two texts using algorithm A is 70%, then union/intersect/otherwise manipulate the texts, run algorithm B, if it scores 90% then run algorithm C to eliminate false positives - made up numbers but you get the idea. Works ok-ish, but as I mentioned it doesn't scale well and I need to do more testing on larger samples.