Should you block GPTBot, ClaudeBot and Google-Extended?

A robots.txt gate allowing AI search and retrieval bots through while blocking the model-training crawlers, illustrating the training-versus-answers tradeoff
A robots.txt gate allowing AI search and retrieval bots through while blocking the model-training crawlers, illustrating the training-versus-answers tradeoff

There's a robots.txt snippet making the rounds, the one that promises to "block all AI from stealing your content." It gets pasted into production by people who are, fairly, annoyed that their work is training somebody's model for free. And a fair number of them are quietly doing the opposite of what they intended: they're not just opting out of training, they're deleting themselves from the AI answers that send real traffic and build the brand mentions that get you cited.

The problem is that "AI crawler" isn't one thing. A bot that scrapes you to train a model is a different program from one that fetches you to answer a user's question right now: different name, different reason. You can allow one and block the other. We run our own page fetchers at LinkGuard, so I've spent more time than I'd like staring at user-agent strings and robots.txt edge cases. Here's what each major bot actually does, the honest case for blocking or allowing, and copy-paste directives that are correct as of June 2026 — with the loud warning that this list rots fast, so verify against the vendors' own docs before you ship.

What are AI crawlers, and which ones actually matter?

AI crawlers are automated bots that fetch your pages on behalf of an AI company. The thing that matters — the distinction almost every "block AI" post blurs — is why a given bot is visiting. There are three jobs, and most providers run a separate, separately-named bot for each:

  • Training — scraping content to train or improve a model. This is the one people most want to opt out of. Examples: GPTBot, ClaudeBot, CCBot.
  • Live retrieval — fetching a page in real time because a user asked the assistant something and it's going to read your page to answer. Examples: ChatGPT-User, Claude-User, Perplexity-User. Blocking these can remove you from answers users are actively asking for.
  • Search indexing — building the AI product's own search index, which is what it cites from. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot. This is your path into AI citations.

Hold onto that split, because it's the whole game. "Should I block AI?" is the wrong question. "Do I want out of training while staying in answers and search?" is the right one — and the answer to that is usually yes.

What does each bot do? (verified June 2026)

OpenAI runs three (their bot docs). GPTBot is the training crawler and it obeys robots.txt. OAI-SearchBot builds the index behind ChatGPT search and also obeys robots.txt. The sharp edge is ChatGPT-User, the live fetch fired when someone asks ChatGPT to look at something. OpenAI says outright that because it's user-initiated, robots.txt "may not apply." So you can cleanly block training (GPTBot) while staying searchable (OAI-SearchBot), but you can't reliably stop the live user-triggered fetch with robots.txt alone.

Anthropic — the people behind Claude — also runs three: ClaudeBot for training, Claude-User for live user fetches, and Claude-SearchBot for indexing (their crawler docs). Disclosure: my own tool's alert summaries run on Anthropic's model, so I'm not a neutral party there. Here's a difference worth knowing: Anthropic says all three honor robots.txt — including the user-triggered one, where OpenAI and Perplexity carve out an exception. If you see the older tokens anthropic-ai or Claude-Web in a snippet you copied, they no longer appear in Anthropic's docs and are widely reported as deprecated. Leaving them in does no harm; just don't rely on them.

Google-Extended is the one people most often get wrong, because it isn't a crawler at all. It's a robots.txt token with no separate bot behind it — a control switch (Google's docs). Disallowing it opts your content out of training and grounding for Gemini. And, in Google's own words, it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal." So this is the safest opt-out on the list: you can refuse Gemini training and lose nothing in regular Google Search. If you're nervous about blocking and want a free win, this is it.

Perplexity is where it gets messier. It runs PerplexityBot (its search index, respects robots.txt) and Perplexity-User (live user fetches), and its own docs say Perplexity-User "generally ignores robots.txt" because it's user-initiated. On top of that, Cloudflare alleged in August 2025 that Perplexity used undeclared, disguised crawlers to reach content that had explicitly disallowed crawling. Cloudflare delisted Perplexity from its verified-bots list over it. Perplexity disputes the framing, calling it user-initiated agentic fetching. As of mid-2026 that disagreement isn't resolved — so I'll report it as Cloudflare's allegation and Perplexity's denial, not as settled fact. The practical takeaway: robots.txt is a weaker lever against Perplexity than against OpenAI or Anthropic.

A few others you'll see in logs: CCBot is Common Crawl, the open dataset many models train on, and it respects robots.txt; blocking it cuts off a lot of downstream training in one line. Applebot-Extended is, like Google-Extended, a control token rather than a crawler — it opts you out of Apple Intelligence training without removing you from Siri, Spotlight, or Safari. Meta-ExternalAgent is Meta's AI crawler. And Bytespider (ByteDance) is the bad-citizen of the bunch — widely reported by third parties to ignore robots.txt and crawl aggressively, though ByteDance hasn't said so itself. Which is the right note to end the roll call on: robots.txt only works on bots that choose to honor it.

Should you block them? The honest tradeoff

There's no universally right answer, and anyone who gives you one is selling something. It comes down to what you're actually optimizing for, and the two goals genuinely conflict.

If your content is the product — you're a publisher, a course seller, a docs-as-moat company, anyone whose words are the thing people pay for — blocking the training crawlers is a defensible call. You're not getting paid when a model ingests your archive and resells the gist of it. Block GPTBot, ClaudeBot, CCBot, Google-Extended, and you've opted out of most large-scale training with four lines and lost very little.

But if you want to show up in AI answers — if you read how to get cited by ChatGPT, Perplexity and AI Overviews and decided that's where your audience is heading — then blocking the retrieval and search bots is self-sabotage. You can't be cited by an engine you've locked out. This is the part the "block all AI" crowd misses: there's real referral traffic and brand visibility in being the source an assistant names, and the unlinked mention that comes with it is starting to count for something. Wall it all off and you've protected content nobody will now find through the channel that's growing fastest.

For most businesses that aren't pure content plays, the sane middle is: opt out of training, stay in retrieval and search. You stop feeding the models for free while remaining reachable in the answers your customers are asking for. That's the configuration I'd reach for by default, and it's the first block below.

How do you block AI crawlers in robots.txt?

Put these in the robots.txt at your site root. To opt out of model training while staying citable in AI search and answers:

# Opt out of AI model training, stay in AI search/answers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Notice what's not in there: OAI-SearchBot, Claude-SearchBot, and PerplexityBot are left allowed, so you stay indexable and citable. That's the point.

If you want to block as much AI access as robots.txt can express — training, search, and the retrieval bots that do honor it — add the rest:

# Block as much AI as robots.txt can reach
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
Disallow: /

Two honest caveats on that second block. It won't reliably stop the user-initiated fetchers — ChatGPT-User by OpenAI's own statement, and Perplexity-User by Perplexity's — and it does nothing about bots that ignore robots.txt entirely (the reported Bytespider behavior). robots.txt is a polite request, not a firewall. If you need real enforcement against bad actors, that's a job for your CDN or WAF, not this file. (Stacking several User-agent lines before one Disallow is valid, but if you'd rather, give each bot its own block with its own Disallow: / — it's equivalent and a touch friendlier to older parsers.) After you edit it, it's worth confirming the syntax actually parses the way you think — our robots.txt tester checks a given URL and user-agent against your rules so you don't find out from your traffic that a typo blocked the wrong thing.

Does blocking GPTBot hurt your SEO?

Blocking GPTBot has no effect on your Google Search rankings — GPTBot is OpenAI's training crawler and has nothing to do with Googlebot or how Google ranks you. Same goes for Google-Extended: Google states explicitly that disallowing it doesn't change your inclusion in Search or act as a ranking signal. So there's no classic-SEO penalty for opting out of AI training.

The cost, if there is one, is on the AI-visibility side, and only if you over-block. Wall off the search and retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the user fetchers) and you remove yourself from AI answers and citations. So the rule of thumb: block training freely, it's nearly costless; block search and retrieval only if you genuinely don't want to be in AI answers. Most people want the first, not the second.

Questions people ask

Does blocking GPTBot remove me from ChatGPT entirely?

No. GPTBot is OpenAI's training crawler. Blocking it stops your content being used to train future models, but ChatGPT can still find and cite you through OAI-SearchBot (its search index) and fetch you live through ChatGPT-User when a user asks it to — and that user-initiated fetch may not honor robots.txt anyway. If your goal is to stay citable in ChatGPT while opting out of training, block GPTBot and leave OAI-SearchBot allowed.

What is Google-Extended, and will blocking it hurt my Google ranking?

Google-Extended is a robots.txt control token — not a crawler — that lets you opt your content out of training and grounding for Google's Gemini models. Google states it does not affect your inclusion in Google Search and is not a ranking signal. So blocking it is the safest AI opt-out available: you refuse Gemini training and lose nothing in regular Search.

Is it even worth blocking AI crawlers if some ignore robots.txt?

For the well-behaved majority, yes — GPTBot, ClaudeBot, CCBot, OAI-SearchBot, Claude-SearchBot, PerplexityBot and the -Extended tokens all honor robots.txt, so a few lines genuinely opts you out of most large-scale training. The honest limit is that user-initiated fetchers (ChatGPT-User, Perplexity-User) may ignore it, and outright bad actors like the reported Bytespider don't respect it at all. For those, you need CDN- or WAF-level blocking, not robots.txt. The file handles the polite bots, which is most of the traffic.

What's the difference between ClaudeBot and Claude-User?

ClaudeBot is Anthropic's training crawler; Claude-User is its live fetch when a person asks Claude to read a specific page. Per Anthropic's docs, both honor robots.txt — which is notable, because OpenAI and Perplexity say their equivalent user-triggered bots may not. So with Anthropic you can actually block the live user fetch via robots.txt if you want to, whereas with the others you can't rely on it.

If I block training crawlers, can I still get cited in AI search?

Yes — that's exactly why the training, search, and retrieval bots have separate names. Block the training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) and leave the search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) allowed, and you stay indexable and citable in AI answers while opting out of feeding the models. There's no single "AI" switch; the granularity is the feature.

The honest takeaway

"Block all AI" is a slogan, not a strategy, and pasted blindly it tends to cost people the AI visibility they didn't realize they wanted. The useful frame is the training-versus-answers split: opting out of training is nearly free and often sensible, while blocking search and retrieval is a real decision with a real cost in citations and referral traffic. Decide which one you're actually choosing, use the directive that matches, and remember robots.txt only governs the bots that agree to be governed.

Before you ship any of this, re-check the user-agent strings against the vendors' own docs — they get renamed and added often, and a stale token is a rule that silently does nothing. Then test your robots.txt against a real URL so a typo doesn't lock out the wrong crawler. Different job, same instinct: if you're the type who keeps a robots.txt tidy, you've probably got backlinks worth watching too, which is what LinkGuard does — free to start with 1,000 tokens, no card. And if you're working through the broader AI-search picture, the companion reads are whether an llms.txt is worth adding and what actually earns you a citation once you've decided to let the right bots in.

About the Author

Andrei

Andrei

SEO and digital marketing professional with 13+ years of experience. Started as a website administrator in 2011, transitioned to SEO, and achieved top-3 rankings for competitive keywords. Co-founded a consulting firm specializing in marketing audits for companies in Ukraine and internationally. Built LinkGuard to solve the problem he experienced firsthand: most SEO teams purchase links but never monitor their survival. Based in Kyiv, Ukraine.

Link copied to clipboard!