State of AI crawler blocking 2026: we checked 123 sites' robots.txt

AI crawler blocking rates by category, 2026 — LinkGuard study
AI crawler blocking rates by category, 2026 — LinkGuard study

Every site owner now has to answer a question that didn't exist three years ago: do you let AI crawlers read your content? Block GPTBot and you vanish from some of ChatGPT's answers. Allow it and your work trains a model you don't own. There's no neutral choice. Just a decision in your robots.txt.

Most owners think they know what their robots.txt says. Then they test it. So we checked what the web decided: we pulled the robots.txt of 123 well-known sites across seven categories and ran each one through LinkGuard's own robots.txt matcher for the five AI crawlers people argue about most. What jumped out wasn't an average — it was how far apart the categories sit.

The one chart that matters: it's a category divide

Whether a site blocks AI crawlers depends almost entirely on what kind of site it is. News publishers, whose product is the writing, block aggressively. Developer sites, which mostly want to be cited by AI, don't block at all; SaaS barely does.

CategorySites that block ≥1 AI crawler
News & media90% (27 of 30)
Reference & community60% (9 of 15)
Blogs & publishers23% (3 of 13)
SaaS & marketing15% (3 of 20)
E-commerce & retail13% (2 of 15)
Gov / edu / nonprofit7% (1 of 14)
Tech & dev0% (0 of 16)

Across the whole sample, 37% of sites block at least one AI crawler, 9% block all five, and 63% block none. The average hides everything — a 90% news figure and a 0% developer figure average out to a number that describes nobody.

Which bot gets blocked most? Not the one you'd guess

Most people assume GPTBot (OpenAI) is the most-blocked crawler. In our sample it isn't.

CrawlerOwnerBlocked by
CCBotCommon Crawl34%
ClaudeBotAnthropic28%
GPTBotOpenAI24%
PerplexityBotPerplexity24%
Google-ExtendedGoogle (AI)22%

CCBot is blocked most — likely because Common Crawl feeds a large share of the datasets models train on, so blocking it is the bluntest "don't train on me" lever. And ClaudeBot is blocked slightly more than GPTBot (28% vs 24%), which surprised us — the OpenAI bot gets the headlines, but Anthropic's crawler gets the Disallow a little more often in this set.

Why the split happens

It tracks incentives. If you sell access to your content (news, large reference sites), an AI answer that quotes you without a click is lost revenue — so you block, and some are negotiating paid licensing instead. If you sell software or services (SaaS, dev tools), being the answer an AI assistant gives a developer is free distribution — so you stay open. E-commerce sits in the middle and mostly doesn't bother. The 0% among developer sites isn't an oversight; it's a strategy.

What this means for your site

  • It's a per-bot decision, not all-or-nothing. Only 9% of sites block everything. You can allow the AI-search bots that send referral traffic (Google-Extended, PerplexityBot) while blocking the training-focused ones (CCBot) — or the reverse. Decide per crawler, per goal.
  • Blocking isn't free. If you block GPTBot and Google-Extended, you opt out of being cited in ChatGPT and Google's AI answers — a growing slice of discovery. For most SaaS and small brands trying to build authority, that's the wrong trade.
  • Check what your file actually says. A typo in a User-agent block, or a Disallow: / you forgot, can silently wall off the bots you meant to allow — and nothing tells you. A forgotten Disallow: / can leave you invisible to ChatGPT for months. Confirm the live behavior; don't assume.

Methodology

We fetched the live robots.txt of 123 well-known sites across seven categories (news, e-commerce, SaaS, tech/dev, reference/community, gov/edu, blogs) on 31 May 2026, and evaluated each for the five crawlers above using LinkGuard's own robots.txt matcher (the same engine behind our free robots.txt tester), which follows Google's longest-match rule. "Blocked" means the crawler's user-agent is disallowed from the site root (/). The sample is reproducible — the script lives in our repo.

Limitations (so you can weigh the numbers)

  • It's a curated sample of recognizable sites, not a random or traffic-ranked top-N — read it as "what well-known sites do", not "the whole web".
  • We check the site root; a site could allow / but block specific paths (or vice versa).
  • It's a snapshot. These files change often as licensing deals and policies shift — a re-run in a few months would move the numbers.

Check your own robots.txt

Before you trust what you think your file does, see what it tells each AI crawler. Our free robots.txt tester shows the verdict per bot in one click (no signup). If you want the full setup, we wrote a robots.txt + AI crawler checklist and a deeper guide on whether to block or allow AI bots.

Frequently asked questions

What percentage of websites block AI crawlers?

In our May 2026 study of 123 well-known sites, 37% blocked at least one major AI crawler in robots.txt — but it varies enormously by category: 90% of news/media sites (27 of 30) versus 0% of developer/tech sites (0 of 16).

Is GPTBot the most-blocked AI crawler?

No. In our sample CCBot (Common Crawl) was blocked most (34%), then ClaudeBot (28%), with GPTBot at 24%. Common Crawl is blocked most because it feeds many AI training datasets, so blocking it is the bluntest "don't train on me" lever.

Should I block AI crawlers in robots.txt?

It depends on your goal. Publishers protecting paid content often block; SaaS, dev tools, and brands building authority usually stay open, because blocking GPTBot or Google-Extended means opting out of being cited in ChatGPT and Google's AI answers. It's a per-crawler decision, not all-or-nothing.

How do I check which AI crawlers my robots.txt blocks?

Use a robots.txt tester that evaluates each AI user-agent against your file. LinkGuard's free robots.txt tester shows the allow/block verdict for GPTBot, ClaudeBot, Google-Extended and PerplexityBot in one click — no signup.

About the Author

Andrei

Andrei

SEO and digital marketing professional with 13+ years of experience. Started as a website administrator in 2011, transitioned to SEO, and achieved top-3 rankings for competitive keywords. Co-founded a consulting firm specializing in marketing audits for companies in Ukraine and internationally. Built LinkGuard to solve the problem he experienced firsthand: most SEO teams purchase links but never monitor their survival. Based in Kyiv, Ukraine.

Link copied to clipboard!