Agentic AI will see AI become fully autonomous and able to carry out tasks and make decisions and even complete purchases without the need for a human.
Web scraping has long been used to obtain information from websites or repositories, but Generative AI is now making this activity so frequent and intensive that it’s being compared to persistent DDoS attacks.
These bots have an insatiable appetite for data which is used for training sets or to answer prompts in real-time. It’s this constant quest for knowledge that is seeing them trawl far and wide and return to the same sites again and again, threatening the viability of the digital ecosystem.
The Wikimedia Foundation, which is responsible for Wikipedia and other sites, recently revealed it’s seen a 50 percent increase in requests, largely driven by these AI bots, and that this demand for data is resulting in higher infrastructure and financial costs.
The Foundation reported LLM bots are consuming 65 percent of its most expensive data (which is served from its main databases rather than from caching servers making it more costly to retrieve) and described the threat posed by AI crawlers as unprecedented and a growing risk and cost.
Herding AI
There are ways to control access to web crawlers, such as by only allowing certain AI to traverse resources. User agent strings in robots.txt (Robots Exclusion Protocol) files can be used to identify crawlers as well as specify which areas of the site they are allowed to visit, with the top user agents found in robots.txt files currently being GPTBot, CCBot, Google-Extended, and anthropic-ai.
Companies need to ensure they are diligent in this regard. The repair website, iFixit, for example, recently found itself subjected to over a million hits from Anthropic’s ClaudeBot in a 24 hour period before it was able to rectify the issue by adding a crawl delay directive extension to its robots.txt. It was fortunate enough to be dealing with an AI that respects robots.txt, as not all LLM observe these directives.
Some crawlers don’t use headers or deliberately use an identifier that looks like one of the more reputable AI platforms. Others systematically change their user agents or cycle through residential IP addresses. This results in a constant battle that monopolises human resources, with the Wikimedia Foundation stating its engineers are having to enforce rate limiting or the banning of crawlers on a case-by-case basis.
Such aggressive trawling can have devastating consequences, with companies unable to prevent what is taken or how it is used. It also has a real impact on infrastructure and operations, with one of the earliest casualties being the open source community.
Open Sesame
Far more cost sensitive and reliant upon the contributions of its members than commercial businesses, open source projects have seen their servers become overwhelmed, with some noting that up to 97 percent of their traffic comprises crawlers.
In an attempt to combat this, they’ve had to block access but doing so affects their visibility in search listings and can cut off the very lifeblood of the project - its user base. In fact, a sysadmin for the Fedora Project recently had to resort to banning the entire country of Brazil, albeit for a short period.
It's also a problem that looks set to worsen under Agentic AI (AGI). The next step in the evolution of the technology, AGI will see AI become fully autonomous and able to carry out tasks and make decisions and even complete purchases without the need for a human. This will see AI communicate with other systems on our behalf, resulting in lightning-fast AI-to-AI communications and will result in web scraping on an industrial scale.
In response to concerns over AI crawlers, a range of tools have sprung up to curb access. These range from defensive solutions (Kudurru) to offensive approaches that aim to weary the AI by sending it on a wild goose chase (Nepenthes, AI Labyrinth) to disguising data (Glaze, Nightshade), and setting computational puzzles (Anubis).
This is very much a new area with some of these in beta phase and reports have emerged of some solutions causing friction for legitimate users.
It's not yet clear how this issue will be resolved. Will the AI platform providers help devise a solution given their dependency on this data for their training models and responses? Will we see class action lawsuits for copyright infringement forcing their hand? Will bot management move into this space to counter the threat? Or will the regulatory authorities seek to address the issue to protect the digital economy?
Only time will tell but in the meantime the likes of the Wikimedia Foundation have vowed to drive down crawler resource consumption. It has proposed plans to reduce scraper traffic across its networks by a fifth with respect to requests and by 30 percent in terms of bandwidth over the course of the next fiscal year, indicating that this is very much seen as a serious threat that must be dealt with.
Written by
Mohammad Ismail
VP of EMEA
Cequence Security