Flash News, Social Botnets

About botnets hindering genuine user experience online: The case of Wikimedia

May 26, 2025

A statement from the Wikimedia Foundation (the organization behind Wikipedia) warns about a hidden cost of generative artificial intelligence. Some of the most trendy and used services based on large language models (LLMs) need constant input from huge amounts of data. This data often comes not only from public and private datasets but also from the internet, collected by bots called crawlers. Crawlers (also called spider bots) are programs usually used by search engines to scan and index web content. But when these crawlers visit sites automatically, like Wikimedia Commons — which holds 144 million images, videos, and files under Creative Commons licenses — they use up a lot of resources, which costs money for Wikimedia. Wikimedia projects rely on two things: free, accessible content and the volunteer work of its community. This makes them very attractive to new crawlers. Besides regular users and search engines, these new bots use up resources from Wikimedia, which may be free to users, but is costly to maintain.

As LLMs and AI chatbots have become more common, the number of requests to Wikimedia has grown a lot. The Foundation reports a 50% increase in download traffic since January 2024. But this increase does not come from people — it is mostly software that is feeding on Wikimedia’s content to train AI models. Now, 65% of the most expensive traffic comes from bots, because a human might read a few pages, but software can request thousands at once.

For example, a person might look up “crawler,” then “scraping,” and so on, on a slow and steady process. But bots download massive amounts of data quickly. Although only 35% of page views are from bots, they generate two-thirds of the heavy traffic. This becomes a problem during peak times (like when breaking news happens) because the bot-generated traffic slows the user experience down. Big tech companies see access to data as extremely valuable for training LLMs. These models need a lot of high-quality, human-made and reviewed content. Synthetic content made by AI often has mistakes, known as “hallucinations.” So, having accurate human-created data is important to keep the quality of results high. For Wikimedia, it is not just about the cost — it is also about people. The more AI companies use free content made by volunteers, the less people visit the site, which could hurt the volunteer community over time. This issue is similar to the one between news sites and social media, or between content creators and AI companies. The problem here is not copyright, but rather that people often will not visit the original site anymore.

With almost half of all internet traffic now coming from bots, the question about the “dead internet” rises again: a version of the internet where bots interact more than real people. In that future, AI assistants find and return information through apps or chat interfaces, and people no longer visit websites at all. That is what autonomous agents are built for: whether it is booking a trip or looking up a word, they do it for and instead the user — without users needing to use a search engine or even Wikipedia.

Prof. Miguel Antonio Barbero Álvarez, PhD. Universidad Politécnica de Madrid

Latest News

European commission’s initiative to strengthen its cyber-resilience

30 June 2025

EU council Blueprint fostering collaboration for countering cyberattacks Last Friday June 6th 2025, the EU Council adopted its Blueprint policy to manage cyber-crisis and

What is the perception of social risk related to Online IDentity Theft?

23 June 2025

We live digitally immersed, connected to an invisible network that accompanies us in every daily gesture: from online shopping to digital signatures, from bank account