Messing around with all this has also made me realize that the amount of robot traffic happening on websites like this one must be through the roof. I haven't looked at any stats, but it's gotta be going bananas.
It's not storing all this, some of it's doing in real time, and I can only imagine that the amount of electricity and bandwidth that's being take up just by robots crawling websites nowadays is crazy. At some point, it's going to be more than us.
I can confirm this. In one example that I manage (a public gitea instance from older days) I see 99.999% of the requests coming from the usual AI suspects (Amazon, Alibaba, Microsoft, Google etc.).
Noticed it because disk was getting full from several dozens of GB of webserver logs (for something that gets like 1-2 human visits a day), the motherfuckers don't even throttle their crawlers, they're firing off requests in the dozens per second, every single one of them behaves like a berserk and of course they do not respect any robots.txt at all.
Given how often you saw normal search engines in your logs and how often you now see AI crawlers, I can only guess that it's like a death-race of who downloads the whole internet first to train their shitty eliza pro+ LLMs with.
And since their hunger for new data is so immense, they have no time to be kind to webservers. Fuck webpages anyway, in their vision people will not visit webpages anymore in the future, they will just ask some shitshippiddi for everything.
Needed some time to collect all the evil cloud networks they come from, but once I had that I traffic-shaped them down to 10kb/s which translates to about one request per minute in total. Even if this only slightly increases their cost of aggro-crawling the web, it's a start..