Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:
https://bitcointalk.org/index.php?topic=4720640.0So that we don't duplicate the effort.
I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).
Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.
Good point, I'll post within that thread once completed. I'm also hoping to hook it into BitcoinTalk & have it automatically update threads, but we'll see. Very much in the planning stage TBH
That's the thing, I've been debating closed source vs open source and the perks of both. What I might end up doing is creating a repo for these scripts, but keeping it private (I have an account on Github I can do this with), and then just inviting users who wish to contribute. Might just leave this to "if there's interest", but thanks for the flag on the potentials of abuse if open sourcing it. I didn't clue into that until now.
+1 for the forum section for spam busting, it'd be easier to keep lists of reported within.
If you're working on plagiarism detection already, I'll probably work on multiple account detection first. Granted, multiple bots running from different developers with different sets of algorithms probably isn't a bad idea (will make it harder for bots to avoid)