February 22, 2020: All updates are now live!
Ever wanted to see who's lying when a post has been edited or deleted? I may be able to help!
I archive most posts within seconds after they are created (before any edits). I started this data collection around the time I started this topic. All data I have since then is available online.
I also have older posts: I've saved (most) unedited posts (6.2 million posts) since September 12, 2018, until the start of this topic. This data has not been added to this topic, and I can't really add it because I tried to remove quotes and that has some bugs. You can request to dig up unedited data when needed.Viewing unedited/deleted postsHow to use it- Find the msgID, userID or topicID you need. Let's use msgID 51902990.
- Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
- Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
Details- Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
- I don't scrape hidden boards (such as Investigations).
- I don't keep post titles
- I save raw HTML, including quotes
- If I run out of disk space, I might create compressed archives per 10,000 posts.
- Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
- My current (sponsored) webhost has enough storage space for years to come.
- All scrape-times use Amsterdam time (CET).
- Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.
Examples
Older postsSneak preview: http://loyce.club/archive/oldposts/How to use:
- Find the msgID you need. Let's use 28228
- Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
- Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
- Add "#msg" and the msgID: #msg28228
- Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228
Limitations- Currently, the first 6.1 million posts are available.
- I'll scrape the first 5.21 million topics and all posts in there.
- That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
- This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
- The time "scraped on" is Amsterdam time.
If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.
If anything goes wrong, let me know here.
See [overview] LoyceV's useful data on Bitcointalk for more of my forum-related topics