AI crawler blocking 2026: who blocks GPTBot

AI crawler blocking rates by category, 2026 — LinkGuard study

Every site owner now has to answer a question that didn't exist three years ago: do you let AI crawlers read your content? Block GPTBot and you vanish from some of ChatGPT's answers. Allow it and your work trains a model you don't own. There's no neutral choice. Just a decision in your robots.txt.

Most owners think they know what their robots.txt says. Then they test it. So we checked what the web decided: we pulled the robots.txt of 123 well-known sites across seven categories and ran each one through LinkGuard's own robots.txt matcher for the five AI crawlers people argue about most. What jumped out wasn't an average — it was how far apart the categories sit.

The one chart that matters: it's a category divide

Whether a site blocks AI crawlers depends almost entirely on what kind of site it is. News publishers, whose product is the writing, block aggressively. Developer sites, which mostly want to be cited by AI, don't block at all; SaaS barely does.

Category	Sites that block ≥1 AI crawler
News & media	90% (27 of 30)
Reference & community	60% (9 of 15)
Blogs & publishers	23% (3 of 13)
SaaS & marketing	15% (3 of 20)
E-commerce & retail	13% (2 of 15)
Gov / edu / nonprofit	7% (1 of 14)
Tech & dev	0% (0 of 16)

Across the whole sample, 37% of sites block at least one AI crawler, 9% block all five, and 63% block none. The average hides everything — a 90% news figure and a 0% developer figure average out to a number that describes nobody.

Which bot gets blocked most? Not the one you'd guess

Most people assume GPTBot (OpenAI) is the most-blocked crawler. In our sample it isn't.

Crawler	Owner	Blocked by
CCBot	Common Crawl	34%
ClaudeBot	Anthropic	28%
GPTBot	OpenAI	24%
PerplexityBot	Perplexity	24%
Google-Extended	Google (AI)	22%

CCBot is blocked most — likely because Common Crawl feeds a large share of the datasets models train on, so blocking it is the bluntest "don't train on me" lever. And ClaudeBot is blocked slightly more than GPTBot (28% vs 24%), which surprised us — the OpenAI bot gets the headlines, but Anthropic's crawler gets the Disallow a little more often in this set.

Why the split happens

It tracks incentives. If you sell access to your content (news, large reference sites), an AI answer that quotes you without a click is lost revenue — so you block, and some are negotiating paid licensing instead. If you sell software or services (SaaS, dev tools), being the answer an AI assistant gives a developer is free distribution — so you stay open. E-commerce sits in the middle and mostly doesn't bother. The 0% among developer sites isn't an oversight; it's a strategy.

What this means for your site

It's a per-bot decision, not all-or-nothing. Only 9% of sites block everything. You can allow the AI-search bots that send referral traffic (Google-Extended, PerplexityBot) while blocking the training-focused ones (CCBot) — or the reverse. Decide per crawler, per goal.
Blocking isn't free. If you block GPTBot and Google-Extended, you opt out of being cited in ChatGPT and Google's AI answers — a growing slice of discovery. For most SaaS and small brands trying to build authority, that's the wrong trade.
Check what your file actually says. A typo in a User-agent block, or a Disallow: / you forgot, can silently wall off the bots you meant to allow — and nothing tells you. A forgotten Disallow: / can leave you invisible to ChatGPT for months. Confirm the live behavior; don't assume.

Methodology

We fetched the live robots.txt of 123 well-known sites across seven categories (news, e-commerce, SaaS, tech/dev, reference/community, gov/edu, blogs) on 31 May 2026, and evaluated each for the five crawlers above using LinkGuard's own robots.txt matcher (the same engine behind our free robots.txt tester), which follows Google's longest-match rule. "Blocked" means the crawler's user-agent is disallowed from the site root (/). The sample is reproducible — the script lives in our repo.

Limitations (so you can weigh the numbers)

It's a curated sample of recognizable sites, not a random or traffic-ranked top-N — read it as "what well-known sites do", not "the whole web".
We check the site root; a site could allow / but block specific paths (or vice versa).
It's a snapshot. These files change often as licensing deals and policies shift — a re-run in a few months would move the numbers.

Check your own robots.txt

Before you trust what you think your file does, see what it tells each AI crawler. Our free robots.txt tester shows the verdict per bot in one click (no signup). If you want the full setup, we wrote a robots.txt + AI crawler checklist and a deeper guide on whether to block or allow AI bots.

Frequently asked questions

What percentage of websites block AI crawlers?

In our May 2026 study of 123 well-known sites, 37% blocked at least one major AI crawler in robots.txt — but it varies enormously by category: 90% of news/media sites (27 of 30) versus 0% of developer/tech sites (0 of 16).

Is GPTBot the most-blocked AI crawler?

No. In our sample CCBot (Common Crawl) was blocked most (34%), then ClaudeBot (28%), with GPTBot at 24%. Common Crawl is blocked most because it feeds many AI training datasets, so blocking it is the bluntest "don't train on me" lever.

Should I block AI crawlers in robots.txt?

It depends on your goal. Publishers protecting paid content often block; SaaS, dev tools, and brands building authority usually stay open, because blocking GPTBot or Google-Extended means opting out of being cited in ChatGPT and Google's AI answers. It's a per-crawler decision, not all-or-nothing.

How do I check which AI crawlers my robots.txt blocks?

Use a robots.txt tester that evaluates each AI user-agent against your file. LinkGuard's free robots.txt tester shows the allow/block verdict for GPTBot, ClaudeBot, Google-Extended and PerplexityBot in one click — no signup.

About the Author

Andrei

SEO and digital marketing professional with 13+ years of experience. Started as a website administrator in 2011, transitioned to SEO, and achieved top-3 rankings for competitive keywords. Co-founded a consulting firm specializing in marketing audits for companies in Ukraine and internationally. Built LinkGuard to solve the problem he experienced firsthand: most SEO teams purchase links but never monitor their survival. Based in Kyiv, Ukraine.

HTML anchor tags compared: a plain followed link next to ones carrying rel=nofollow, rel=ugc, and rel=sponsored

SEO Strategies

Dofollow vs nofollow links: what they are and when each matters

A nofollow link won't pass ranking credit like a regular link does, but calling it worthless is wrong. What dofollow, nofollow, sponsored, and ugc do, and how to check any link in seconds.

Andrei

May 29, 2026 • 11 min

SEO Strategies

GEO vs SEO: why your content strategy needs both in 2026

Your rankings stayed the same but traffic dropped 34%. The culprit? AI Overviews answering queries before users click. Learn what GEO (Generative Engine Optimization) is, how it differs from SEO, and which tactics actually work — backed by Princeton research data.

Andrei

Dec 10, 2025 • 8 min

SEO Strategies

What SEO tools are people actually using in 2026? We read 200+ Reddit threads to find out

We analyzed 200+ discussions from Reddit, LinkedIn, and Quora to find out what SEO tools practitioners actually use daily. From Google Search Console as the undisputed foundation to heated Ahrefs vs Semrush debates — here's what the community recommends (and what they don't).

Andrei

Dec 06, 2025 • 9 min

The one chart that matters: it's a category divide

Which bot gets blocked most? Not the one you'd guess

Why the split happens

What this means for your site

Methodology

Limitations (so you can weigh the numbers)

Check your own robots.txt

Frequently asked questions

What percentage of websites block AI crawlers?

Is GPTBot the most-blocked AI crawler?

Should I block AI crawlers in robots.txt?

How do I check which AI crawlers my robots.txt blocks?

About the Author

Andrei

Related Articles

Dofollow vs nofollow links: what they are and when each matters

GEO vs SEO: why your content strategy needs both in 2026

What SEO tools are people actually using in 2026? We read 200+ Reddit threads to find out