Post
Topic
Board Reputation
Re: 35M posts! View unedited/deleted posts (search per post, per user or per topic)
by
LoyceV
on 05/06/2020, 19:24:27 UTC
Millions more posts added:
I have now archived the first 35.5 million posts, all available online. This included posts made in topics created until April 24, 2018 and currently fills 43 GB.
Example: my first post!

See this quote on how to use it:
Sneak preview: http://loyce.club/archive/oldposts/
How to use:
  • Find the msgID you need. Let's use 28228
  • Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
  • Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
  • Add "#msg" and the msgID: #msg28228
  • Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations
  • Currently, the first 2.1 million posts are available.
  • I'll scrape the first 5.21 million topics and all posts in there.
  • That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
  • This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
  • The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.

This bug is not fixed yet:
I found a bug (which I'm posting here as a reminder to myself): Posts on the עברי (Hebrew) board don't show up. Example: this post is missing, while it exists.
I'll see if I can add them later. I think it has something to do with the right-to-left writing, even selecting text on that board doesn't work as expected.
Update: عربية (Arabic) has the same problem.
I'll re-scrape these boards after finishing scraping all posts.



Todo:
When I have the time, I'll create something to classify all posts in a requested topic as "unedited", "deleted and archived", "edited within 10 minutes" or "edited after 10 minutes". But that will only be for one topic at a time, you can't easily check all posts.
Another Todo: I should create this per user, that could prove very useful. Deleting a post would make that post stand out more!