Once you get it running to some meaningful extent I would suggest to post the scope you're working on (set of users, threads) in iasenko's thread here:
https://bitcointalk.org/index.php?topic=4720640.0So that we don't duplicate the effort.
I'm experimenting with some NLP techniques for plagiarism detection and the results are promising although scalability is a bit of an issue. Currently working just on comparing Bitcointalk posts (not to outside sources).
Perhaps it's better not to publicize too many specific details on how the scripts work - might inadvertently help bot-farmers. I wish there was a section of the forum designated for spam-busting efforts, I believe hilarious has suggested this.