P.S: If any mods/admins aren't ok with me scraping the site, by all means let me know. I'd obviously write the bot/script in such a way that it doesn't slam the server & only send a certain amount of requests per second/minute (more or less like a Google bot). I know other users have written similar bots/scraping tools, so I thought it'd be ok. But if not, just let me know

I've recently started scraping
recent. My script saves the first unedited version of the post in raw HTML, excluding quotes. Your post for example looks like this:
Initscri
186520
45883661
Other /
Meta /
"Multiple Accounts" / Copy-pasta detection scripts/botsHey all,
I've been planning to write a few scripts relating to BitcoinTalk. It's been on my "developer bucket list" to write something to detect users who have multiple accounts. In order to accomplish this, and have a reliable list, I'd have to determine some logic in order to base this.
I have a few things in mind:
Index/scrape posts &:
For
multiple account detection:
- Look for same address usage between posts (BTC, ETH, etc)
- Look for same account usage between posts (telegram, skype, etc)
- [other ideas here]
For
copy-pasta detection:
- write a script to determine copy-pasta from accounts by matching the text of posts to similar text of other sites in order to return a probability percentage of the user copy/pasting (including src for manual analysis)
- [other ideas here]
Results would be posted here for mods to look at (if need be), or just to keep a record of such a connection. I'd also probably link to results in
this topicI wanted to post this thread in advance to see if anyone else had any other logic / ideas in mind for these scripts/bots? This will solely be when I have the time to create this (which won't be for a couple of weeks), so I thought I'd post this well in advance.
Thanks!
The first line is your Username, then userID, post number, some raw headers, and the last line is the post itself.
In compressed format, it takes about 10 MB per day. Instead of scraping the same data again, I could easily send it to you, and a few day's worth of data should be enough for you to start testing. If interested, let me know.
You'll be in for a surprise if you start looking for plagiarism! I sometimes sort a day's worth of posts and search for exact duplicates. This typically gives a few dozen posts that are posted a few dozen times. Most of them are spam, many of them are just spammers posting the same useless "proof of authentication" and more crap like that.
Detecting the text spinners will be a whole different level!