About botnets hindering genuine user experience online: The case of Wikimedia

iStock 1483618422

A statement from the Wikimedia Foundation (the organization behind Wikipedia) warns about a hidden cost of generative artificial intelligence. Some of the most trendy and used services based on large language models (LLMs) need constant input from huge amounts of data. This data often comes not only from public and private datasets but also from the internet, collected by bots called crawlers. Crawlers (also called spider bots) are programs usually used by search engines to scan and index web content. But when these crawlers visit sites automatically, like Wikimedia Commons — which holds 144 million images, videos, and files under Creative Commons licenses — they use up a lot of resources, which costs money for Wikimedia. Wikimedia projects rely on two things: free, accessible content and the volunteer work of its community. This makes them very attractive to new crawlers. Besides regular users and search engines, these new bots use up resources from Wikimedia, which may be free to users, but is costly to maintain.

As LLMs and AI chatbots have become more common, the number of requests to Wikimedia has grown a lot. The Foundation reports a 50% increase in download traffic since January 2024. But this increase does not come from people — it is mostly software that is feeding on Wikimedia’s content to train AI models. Now, 65% of the most expensive traffic comes from bots, because a human might read a few pages, but software can request thousands at once.

For example, a person might look up “crawler,” then “scraping,” and so on, on a slow and steady process. But bots download massive amounts of data quickly. Although only 35% of page views are from bots, they generate two-thirds of the heavy traffic. This becomes a problem during peak times (like when breaking news happens) because the bot-generated traffic slows the user experience down. Big tech companies see access to data as extremely valuable for training LLMs. These models need a lot of high-quality, human-made and reviewed content. Synthetic content made by AI often has mistakes, known as “hallucinations.” So, having accurate human-created data is important to keep the quality of results high. For Wikimedia, it is not just about the cost — it is also about people. The more AI companies use free content made by volunteers, the less people visit the site, which could hurt the volunteer community over time. This issue is similar to the one between news sites and social media, or between content creators and AI companies. The problem here is not copyright, but rather that people often will not visit the original site anymore.

With almost half of all internet traffic now coming from bots, the question about the “dead internet” rises again: a version of the internet where bots interact more than real people. In that future, AI assistants find and return information through apps or chat interfaces, and people no longer visit websites at all. That is what autonomous agents are built for: whether it is booking a trip or looking up a word, they do it for and instead the user — without users needing to use a search engine or even Wikipedia.

Prof. Miguel Antonio Barbero Álvarez, PhD. Universidad Politécnica de Madrid

Latest News