AI Chatbots Attacks May End the Internet as We Know It

Yves here. We had a serious problem a few weeks back of the site being the target of a AI bot site scraping attack and as you can see below, we are far from alone.

It was clear this invasion was highly sophisticated, had been costly to devise and not at all like a DDoS attack. It sufficiently slowed the site that several key writers and an admin were unable to post (one was, intermittently so we were able to limp along for the first 24 hours with only a minor interruption to our regular schedule). Our tech maven Dave thought it was highly probable that this was AI site scraping. I will not tip our hand by describing the bot behavior, but if I were to describe it, I am pretty sure that most of you would agree.

Dave was able to take some brute force measures quickly, which worked but were a tad inconvenient to the authors and admins. With further jiggering and some input from Cloudflare, he was able to block the incursion without hampering normal operations. It sounds as if openDemocracy has had more difficult in righting its ship than we did in the end. But these attacks are super painful when underway, particularly since it is not clear how and when they’ll be beaten back.

So again, if you wonder why yours truly is hostile to AI: I am not exaggerating when I say its backers are out to destroy this site. Dave added:

This reminds me of Aaron Swartz who got in way too much trouble for copying JSTOR data and publishing it. Somehow that was criminal behavior, but scraping sites so hard it hurts them is fine.

By Matthew Linares, Technical and Publishing Manager and Editor of digitaLiberties at openDemocracy. He leads on openDemocracy’s CopySwap media network project. Originally published at openDemocracy

In recent weeks, openDemocracy’s website has been repeatedly knocked offline, as if under enemy attack. The partial cause is a gargantuan barrage of automated bots from AI companies.

As people increasingly turn to chatbots and other large language models (LLMs) rather than searching or visiting sites directly, AI companies are using large-scale scrapers to gather the data to power their services.

Many of these scrapers are so sophisticated that it is hard, or impossible, to detect them in action. They often ignore the websites’ programmatic pleas not to be scraped, and are known to hit the more fragile parts of a website repeatedly.

For media sites, the AI pincer movement decimates traffic, while undermining our ability to serve the traffic we do get with these mammoth crawling operations.

Grant Slater is a core developer at OpenStreetMap.org, an online map tool co-created by mappers all over the world, who contribute and maintain data about roads, paths, cafés, stations, and more. He told openDemocracy he has seen a “dramatic shift” in the rise of scrapers over the past two years – to the point where OpenStreetMap is now “at constant war” with them.

“We assume they are scraping data to fuel AI LLMs or start-ups set up to supply the AI companies with training data. There’s an unspoken arms race happening across the web: public-interest projects are being hit by industrial-scale AI scraping,” he said.

Web traffic dashboard by Cloudflare showing January 2026 bot activity.

Slater explained that “the traffic often arrives through anonymous residential IPs”, referring to residential proxy networks that route internet traffic through intermediary servers using IP addresses assigned by internet service providers to real homeowners. This, he said, makes it “hard to distinguish ‘normal users’ from automated collection”.

“We’re being forced into permanent defence mode. Residential proxy networks let AI scrapers hide in plain sight, rotate identities, and extract data at scale. That shifts real costs onto projects that exist to serve people, not feed training pipelines.

“Scrapers can rotate addresses endlessly to harvest as much data as possible for LLM training, while we have finite bandwidth, compute, and volunteer time. The result is a war of attrition we didn’t choose – where keeping the site usable becomes a constant battle.”

The resources required to keep these marauding invaders at bay are another concern in the growing list of reasons to worry about AI. Websites may soon be forced to require hard proof of humanity from all visitors, such as insisting on users being signed in.

“The actors extracting the most value from large-scale crawling are largely insulated from the costs it creates, which are absorbed by publishers. They must respond by restricting access in order to survive,” writes researcher Audrey Hingle in her paper, Getting bots to respect boundaries.

“Over time,” Hingle continues, “this risks accelerating enclosure: more gated content, more limited access, and a web that is harder to participate in.”

Mind Your Information Diet

The advance of AI is having an impact on how people stay informed, with people prompting chatbots for news stories and other general inquiries.

AI agents, in their various forms, are here to stay and will make some elements of news provision more personalised and insightful. But the big firms leading the charge are animated by commercial interests, with ChatGPT now launching ads in its products.

The toxicity of social media, where financial imperatives drive design, is a worrying template for what’s to come with AI. Hallucinations, deception, propaganda and sycophancy, as chatbots are known for, will also muddy the waters.

Products like the Cloudflare AI Labyrinth use “AI-generated content to slow down, confuse, and waste the resources of AI crawlers and other bots that don’t respect “no crawl” directives.” https://blog.cloudflare.com/ai-labyrinth

As trust online becomes an increasingly fragile quantity, web users must be savvy to ensure they are getting accurate information from honest sources, scrutinising where news comes from and consolidating direct links with reputable organisations, such as with email newsletters – like openDemocracy’s, which are available here.

Just as important as minding how chatbots curate your info diet is what you allow to be taken from you. The AI age continues the era of surveillance capitalism: your data is still the product. Now they know more about you than ever given the depth of interactions with AI bots.

With this in mind, Tony Curzon Price, erstwhile editor-in-chief of openDemocracy, launched the First International Data Union to let people use LLMs without operators like OpenAI hoarding valuable data from those chats.

“Our data, used both for our personal benefit and for collective benefit, is the one chokepoint that ordinary citizens have over platform and BigAI power. I set up FIDU to be the trusted steward of members’ interests in cyberspace,” Curzon Price explained. “The rapid AI-fication of the web makes it all the more urgent that we create meaningful counterweights to planetary-scale profit-driven data extraction we’re seeing by the AI labs.”

Let’s hope that innovations like this give the public more control of AI in their lives.

Oversight and Pushback Are Essential

We can recognise the benefits that AI systems may bring and still feel that negative effects may outweigh the positive.

The competition to win the AI race is already producing multiple harmful outcomes. From the expansion of environmentally destructive data centres, to AI’s use for paedophilia or other sexual abuse, to twisted ‘therapy bots’ and the potential replacement of humans in society at large, the effects are startling and unravelling quickly.

Even industry leaders who do care about the outcomes are caught in this dynamic to produce superintelligence, which is also a bellicose national security contest as much as it is a commercial one. There is not even a clear endpoint to the struggle, though some have imagined where we may end up.

In the meantime, active citizens must drive political pressure to ensure the industry doesn’t overpollute, or create the next financial collapse, or poison the well of public information.

For now, stay independent and maintain connections with media you trust, away from corporate platforms of dubious integrity.

Print Friendly, PDF & Email

35 comments

  1. flora

    Thanks for this post. It’s a very good description of what is going on in the latest online attack-force play.

    1. flora

      adding: This is a link has good information for home Windows pc users. I’ve used it for years. Very old-school graphics and appearance. “ShieldsUP”.

      Making your home pc invisible, as much as possible, to scrapers and chatbots can make your pc run faster. Who knew? / ;)

      https://www.grc.com/shieldsup

  2. t

    Presumably, some real-life American people are dumb enough to think the SAVE Act is about requiring an ID to be eligible to vote. Not enough to create the tsunami of pro-SAVE activity on Twitter.

    There were some photos of Pete the Secretary of Wet-Brain He-Man Jesus-Lovers in a “war room” just agape over Twitter when he should know that 90% of posts he was looking at was in the service of this administration and not in any way a peek at the hearts and minds of those he swore an oath to serve. Ouroboros of stupidity and evil.

  3. fjallstrom

    Had a similar experience recently with AI-bots DDOSing a login page by repeatedly scraping it. Just the login page, and let me tell you the login page doesn’t change much between scrapings. So I’m that case I don’t think it was very well targeted, or even targeted at all. By looking at the identification markers the bots left we managed to block them without shutting out any users.

    Before we were done I came across ways to send scraping bots down an infinite well by feeding them generated letters, the idea was to both keep them busy and give them nonsense data as a bit of payback. Didn’t look into it more than that, so how easy it’s to set up, how much resources it takes and how well it works are still unknown to me, but I figured I’d mention it if it helps someone else.

  4. ChrisFromGA

    Thanks as well.

    There is another unrelated type of attack that is more along the lines of LLM poisoning.

    Described here: I hacked ChatGPT and Google AI – and it only took 20 minutes

    It’s roughly analogous to the “Google bombing” of the early 2000’s but more pernicious and harder to defend due to the way these scrapers work.

    Here is how it works:

    1. I sign up for a blogging service like substack, or create my own blog with my own infrastructure.
    2. I create a fake post claiming that I’m the best boot maker in the world.
    3. AI scrapers grab the bait, and now I’ve got a new career as a fraudster selling fakes of expensive boots.

    It would be a lot harder to poison LLMs to get them to say that the Patriots won the last Superbowl, as that is a much more specific news event covered by 1000’s of sites. But the general concept is that these LLMs simply cannot be trusted.

    All the fake news sites that report CIA propaganda come to mind – Business Insider, Forbes, Newsweek. Maybe that’s been the point all along. Create your own reality, get it published on the web, and voila – whatever narrative you want to sell is out there.

    I can’t think of a solution to this vulnerability offhand. Whitelisting the stuff you scrape to only trusted news sites? But what is “trusted” anymore? WaPost … not. Bezos just canned half the staff and they’ll probably be replaced with AI.

    This LLM thing is a giant debacle.

  5. vao

    A variety of tools to send AI bots on errands when they attempt to scrape a site have been published (they are called “digital tarpits” — look for the term in your favourite search engine). They basically work by generating an endless number of interlinked pages filled with junk, keeping the AI bot busy downloading useless content. Nepenthes and locaine are well-known ones, but there are others (e.g. Quixotic, Poison the WeLLMs).

    Disadvantage: this does not decrease bandwidth hogging by the scrapers, at least in the short term.
    Advantage: after a while, the AI firms owning the bot realize they are getting useless trash and completely stop scraping the site.

    Cloudflare AI Labyrinth certainly belongs to that class of tools (judging from the name…)

    Blocking AI scrapers is difficult, as they camouflage as “normal” Internet users and keep changing their identity (as explained in the article). Cloudflare may well use behavioural analysis (like virus scanners do on PCs) to detect them.

    Ultimately, websites will probably make their content only available as PDF (formatted as bitmaps, so that files can be processed only after undergoing OCR), or go back to the old-fashioned mailing lists with personal registration requirement. The WWW is dying before our eyes.

  6. Thasiet

    This reminds me of Aaron Swartz who got in way too much trouble for copying JSTOR data and publishing it. Somehow that was criminal behavior, but scraping sites so hard it hurts them is fine.

    GamersNexus yesterday, on the new lawsuit against NVIDIA for mass AI scraping, “Piracy Is Only Illegal for You”:
    https://youtu.be/Sdry-clMeRs?si=SZkUau2LV30GfUqL&t=2190

    I was introduced to Swartz’s RSS long ago by a high school friend who interned for the Electronic Frontier Foundation in college and believe knew Swartz personally. I thought it seemed gluttonous at the time, why would you ever want to consume a firehose of information like that?

    Today I get my Naked Capitalism served up via RSS on a defunct google reader app that somehow still works with a feedly login. It’s long since been delisted from the google play store, so I have to sideload my backup whenever I get a new phone.

    The other day I heard it argued that the JFK assassination precisely marks the moment when American society went off the rails. Well if any one person’s death can matter that much, then the same must hold true for the death of Aaron Swartz and the evolution of the internet.

    1. Amr

      Always remember that Meta, which by torrenting the Anna’s Archive dataset (IIRC about ~100TB of e-books) to train its chatbots did something theoretically much more worse than whatever “crime” Aaron Swartz ever committed in the eyes of those that charged him. I doubt we’ll ever see any serious consequences for Meta from this though.

      It’s pretty disheartening to see just how closed and commercialized the Internet has become now. The most positive impacts on people’s lives due to the internet I’ve witnessed have been because of people that shared Aaron’s vision of what the internet could be instead of whatever it’s becoming now.

  7. French75

    I strongly suggest widespread use of a poison fountain for these crawlers: https://rnsaffn.com/poison3/

    The above degrades the code; but one could simply use an open LLM to generate reams of garbage content and link to it from the comment section alone. If successful we could find nakedcapitalism black-listed from such crawlers…

    > away from corporate platforms of dubious integrity.

    Are there any corporate platforms **not** of dubious integrity?

    1. Alena Shahadat

      This is a very clever idea. Does anybody know how to integrante the code ? I suppose it goes in the html on the top of the document ?

      1. vao

        From the page referred to by French75, it seems that the technique used is as follows:

        1) Embed URLs to the “poison well” within your normal HTML pages. Give them some unobtrusive name — or conversely an attractive one for bots.

        2) Those URLs must be hidden from normal end-users, so that they do not click on them and follow them to the poison well. This can be achieved via some CSS (you’ll have to ask your web designer or look for CSS code snippets on the Internet). The scraper bots, on the other hand, will in principle follow all links mentioned in the HTML page regardless of whether they are made visible on the screen to end-users or not.

        3) The poison well returns some apparently useful content but is actually junk — and will poison the training data of AI servers with incorrect / improper / useless information. Try the one indicated in the page given as reference: every access to it returns a different module of what appears to be legitimate code, but is quite odd when looking in detail.

        The real issue is how to implement a poison well that can automatically produce a wide variety of pernicious content for AI bots.

        1. Alena Shahadat

          Thank you so much !
          It seems this is a kind of action that works if it is implemented by maximum of people.

  8. Spastica Rex

    Ouroboros

    The Interweb snake is consuming itself. Good riddance, when it comes. I’m saying this as a former 20 year “educational technology” “professional.” Guess why I left?

    What comes after may be worse, though.

  9. thoughtfulperson

    If it turns out we have to log in I’d be ok with that. I’m sure NC would do a much as possible to keep the logins private.

    Thanks as always to Yves and the team for keeping this effort to keep some reliable info out there!

    1. The Rev Kev

      Same here with many thanks to Yves and her team, especially Dave. If the situation gets worse, perhaps a simple log-in page may be the only solution. It riles me that Silicon Valley is wrecking the internet and poisoning the well so to say so that they can personally make more profit for themsleves.

      1. Alena Shahadat

        I still have a wallpaper from a couple years back. It’s a picture of people protesting and one person holds a big sign with a grumpy cat saying : “The NSA broke my Internet, I will have to make a GNU one”.

        People are working on an alternative. But they’re chronically underfunded.

        Also, years back we lost all the “heritage” websites full of useful information. I was Reading up on aquarium plants gardening on small clubs websites from Netherlands… They did not switch to https…. I imagine we lost throves of small sites contents back then.

    2. Ignacio

      Yep for me. If it is necessary to protect the site.
      But I guess it might limit visits from those who don’t like to bother logging in.

      1. Alena Shahadat

        ” it might limit visits from those who don’t like to bother logging in.”

        Yes, I would definitly bother !

  10. Tim

    Our AWS compute spend is going up significantly starting December because of scrapers on public facing websites. At scale this requires more datacenters.
    And where does this traffic originate? Wouldn’t that be a datacenter? Likely even AWS.
    And the data goes toward a model in a datacenter, maybe AWS too, to let an LLM, hopefully without hallucinating, reproduce what was already on our website.
    LLMs have resulted in a paperclip like machine, only data centers instead.

    1. Amr

      And where does this traffic originate? Wouldn’t that be a datacenter? Likely even AWS.

      I don’t think this is possible to tell the origin unless you’re responsible for the infrastructure at one of these companies. AI scrapers go to great lengths to obscure the fact that they’re bots by e.g. simulating web browsers and routing traffic thorough residential home connections of unsuspecting users.

      There’s companies like Bright Data that specialize in the proxying through residential internet connections part – they distribute apps of some marginal use (e.g. unblocking region-locked Netflix or Youtube) with the fine-print in the user agreement being that the company can commandeer the user’s internet connection at any time for any purpose.

      Bright Data is also Israeli (why are the most evil tech companies always Israeli?) and also provides services to scrape social media. This would almost certainty require making thousands of legitimate-looking social media accounts since scraping is against the ToS of most services. You have to wonder what else those accounts are being used for in light of Israel’s declining reputation due to the horrors they have inflicted Gaza.

  11. Carolinian

    As a map junkie I love Open Street Map and didn’t realize they are having these kinds of problems.

  12. NotThePilot

    Other people have mentioned things from the defensive “tar pit” perspective, but from the scraper’s standpoint, there’s really no excuse for this. It just shows arrogance and gluttony.

    I’ve done web-scraping projects, granted not trying to feed the entire internet to an AI, but standard tools I’ve used before (curl for example) give you an easy-to-use “politeness” option that paces out your requests. There’s even a dynamic variant that slightly randomizes the pace (partly to get around simple bot blocks), but if I’m not mistaken, you can also set it to adaptively stay below a bandwidth ceiling. Throw in the (seemingly obvious to me) parallelism of polling sessions across different providers within a time slice and there’s no reason to DoS-attack anyone’s site.

    Presumably these organizations have the resources and people to do The Right Thing (TM). The fact that they don’t just speaks to their entitlement and/or disdain of actual technical excellence.

    1. hazelbee

      I read it the same way too.

      Badly behaved web scraping written by people that don’t care or don’t know.

      This will get sorted eventually.

      You can even make the contrarian view that this is evidence of healthy competition.

      At the moment web search is dominated by if we’re generous a handful of companies, realistically just Google.

      Evidence of other companies building up indexes to be able to compete for our attention? That’s evidence of competition again in the attention economy.

  13. SZ

    And I was wondering why Cloudflare kept blocking me from accessing NC. No “prove you’re a human”, just a straight block.

    1. Yves Smith Post author

      I am sorry. I have been getting that notice with Anti-Spiegel. So had Micael T. That block message tells you how to send your IP address to Cloudflare and get unblocked. Micael T found that worked. I have not had the time to try yet.

    2. flore

      adding to my above comment:
      I also wonder if CloudFlare is trying to drive people to use its new dns resolver by putting up delays and roadblocks to people not using its new dns resolver. I’m afraid such behavior by a tech company would not surprise me. Not saying this is what CloudFlare is doing.

Comments are closed.