If it is not a secret, how much data space is needed for all that millions of posts?
I'm currenly using 54 GB for loyce.club, and store 4.2 million files.
And is there a way to use some compression?
I mainly store HTML-files, so indeed, it would be great if a webbrowser would just be able to use index.html.gz to largely reduce the disk space consumption, but I just tested it and my browser doesn't get it.
Due to my lack of time it took longer than I wanted, but I now added live updates for posts
per user and
per topic:
Viewing unedited/deleted postsHow to use it- Find the msgID, userID or topicID you need. Let's use msgID 51902990.
- Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
- Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
Details- Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
- I don't scrape hidden boards (such as Investigations).
- I don't keep post titles
- I save raw HTML, including quotes
- If I run out of disk space, I might create compressed archives per 10,000 posts.
- Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
- My current (sponsored) webhost has enough storage space for years to come.
- All scrape-times use Amsterdam time (CET).
- Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.
Examples