LinkGuard cover — Robots.txt for AI crawlers: the complete config checklist (2026)

You open robots.txt and realise it hasn't been touched since 2022. Back then AI crawlers weren't a category. Now there are at least ten user-agent names you need an opinion on, and getting one of them wrong means either six weeks of being invisible to ChatGPT or six weeks of being scraped while you think you blocked the door.

The worst version of this story isn't the typo. It's the 3am Sunday when you discover the typo and realise nobody's been reading the file for two months — long enough that the missed citations are someone else's now.

This is the configuration checklist version of our long-form AI bots in robots.txt article. Same opinions, different format — a 22-item tiered list you can work through, save your progress in your browser, and re-open the next time you audit a site. Items are tagged critical (skipping costs you real traffic), important (skipping is a footgun), or nice to have (skipping is fine).

How to use this checklist

Tick items as you go — progress lives in your browser, no account needed. Click How to do this inside an item for the exact steps. Filter by tier if you only want to ship the critical fixes first. When you finish, hit "Share progress" to copy a one-line summary you can paste into your team's Slack ("20/20 done — robots.txt audit live").

If you only have 15 minutes, do every critical item in order. The rest can wait for the quarterly review.

Who this is for

Anyone touching robots.txt on a production site. Specifically:

SEO leads doing a quarterly audit of a client site.
SaaS founders configuring a new marketing site or migration.
Publishers deciding which AI crawlers to allow vs block.
Agencies standardising a robots.txt template across many clients.

If you have under a dozen pages and no plan to be cited by ChatGPT, you can probably skip most of this and ship the default User-agent: * group with a sensible Sitemap: line. For everyone else, the twenty-two items below.

One vocabulary note before we start

A user-agent group is one User-agent: line plus every Disallow: and Allow: line beneath it until the next User-agent:. Each crawler reads top-to-bottom, picks the single most-specific group that names it, and follows only that group. User-agent: * is the fallback for crawlers that don't have their own group — never a base layer that other groups inherit from. Carry that mental model through the rest of the checklist.

What success looks like

Done right, robots.txt is a file you touch four times a year, sleep easy about the rest of the time, and forget exists. A month from now, when somebody in your buyer's Slack asks ChatGPT which tools to use, your name is in the answer — not part of the cohort that got the file wrong. The 22 items below are the cost of buying that quiet.

0 / 22 · 0%

Critical Important Nice to have

Show only undone

Three concept checks before you open the file. Skip these and you'll make confident decisions for the wrong reasons.

Important Pick your site bucket (SaaS / publisher / agency / affiliate) and write it down Without a written bucket, the next engineer touching robots.txt will reverse your call by accident. The four buckets imply different defaults — an ad-supported publisher should block training crawlers, a B2B SaaS should leave them open.

How to do this

Score the site on two questions: (1) Is organic Google traffic more than ~50% of total acquisition? (2) Would you trade being scraped by AI for a chance at being cited in ChatGPT/Perplexity answers?
Bucket: ad-supported publisher / paywalled journalism → block training crawlers. B2B SaaS, docs, open-source → allow everything. Agency / consultancy → allow but wall off proprietary case-study URLs. Affiliate listicle → robots.txt won't save you; invest in original content instead.
Default if unsure: B2B SaaS pattern.
Critical Confirm robots.txt is at the site root and returns 200 Crawlers only check /robots.txt at the root. Files in subfolders are ignored. We've seen a perfectly written robots.txt in /wp-content/ do nothing for a year — and we've seen a CDN-cached 5xx on the root file make Googlebot pessimistically slow its crawl. Both are silent and easy to miss.

How to do this

From a terminal: curl -I https://yoursite.com/robots.txt. Confirm 200 OK and Content-Type: text/plain. If you get 404, the file isn't served from the root — move it. If you get 301/302, follow the chain and make sure the final response is plain-text 200. If you get 5xx, your CDN or origin is broken; fix that before anything else.
Important Write a one-line policy comment at the top of the file Without a written rationale, in six months you (or the next agency) won't remember why you allowed GPTBot or blocked CCBot — and someone will reverse the call by accident during the next audit. The comment is the single artifact that survives team churn.
How to do this
Decide the stance with the content owner (founder, editor, CMO) before writing the file. Then add a single top-of-file comment, dated:
# Policy: allow all AI crawlers (SaaS visibility strategy, 2026-05-26)
The date matters — it tells the next reviewer when the policy was last revisited.

These bots fetch your pages to fold into the next model. Blocking them costs nothing today; allowing them buys a non-zero chance of being named in future AI answers.

Critical Choose allow or block for GPTBot (OpenAI's training crawler) GPTBot was announced in August 2023. It fetches only for training and does not power ChatGPT browsing — that's ChatGPT-User, a separate group. Block to opt out of OpenAI's next training pass; allow to keep a shot at being in future model weights.
How to do this
To block:
User-agent: GPTBot Disallow: /
To allow (default), no group is required, or be explicit:
User-agent: GPTBot Allow: /
Check your robots.txt free
Critical Choose allow or block for ClaudeBot (Anthropic's training crawler) ClaudeBot is Anthropic's training crawler. Anthropic also runs Claude-SearchBot for search indexing and Claude-User for user-triggered fetches — three separate groups, three separate decisions. Don't conflate them.
How to do this
Block training only:
User-agent: ClaudeBot Disallow: /
Common typo to avoid: Claude-Bot with a hyphen does nothing. The Anthropic user-agent is ClaudeBot, one word.
Check your robots.txt free
Critical Choose allow or block for PerplexityBot (Perplexity's index crawler) PerplexityBot builds Perplexity's search index — its citations are clickable, so being in the pool is valuable top-of-funnel for SaaS and docs. Blocking this is a content-product decision, not a default.
How to do this
Block:
User-agent: PerplexityBot Disallow: /
Note Perplexity also runs Perplexity-User for user-triggered fetches — separate group, see the User-triggered fetchers category below.
Check your robots.txt free
Critical Choose allow or block for Google-Extended (Gemini and Vertex AI training only) TL;DR: Google-Extended controls Gemini and Vertex AI training. It does NOT affect Google Search rankings and does NOT affect AI Overviews — those are powered by the regular Google index, which is controlled by Googlebot. Decide based on Gemini-citation strategy only. Most teams don't realise they can opt out of Gemini training and keep AI Overview eligibility — you can.
How to do this
To opt out of Gemini training only:
User-agent: Google-Extended Disallow: /
Common mistake: GoogleExtended (no hyphen) does nothing. Correct user-agent is Google-Extended with the hyphen. The Google Search Central crawlers page confirms the scope on the Google-Extended row.
Check your robots.txt free
Important Choose allow or block for CCBot (Common Crawl, feeds many models) Common Crawl has scraped the open web since 2008 and its archive feeds dozens of model providers. Blocking CCBot in 2026 doesn't claw back anything already in the archive — it only stops future fetches. Most SaaS sites leave it open; most publishers block.
How to do this
Block:
User-agent: CCBot Disallow: /
Reference: the Common Crawl CCBot page documents the crawler's fetch behaviour and IP ranges.
Heads-up: our robots.txt tester currently covers six crawlers (Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Google-Extended) and does not yet include CCBot — verify your CCBot rule by hand until we extend coverage.

These bots only fire when a specific user pastes your URL into ChatGPT / Claude / Perplexity. Blocking them breaks a feature your readers might actually use — the user already chose to read your site, you're just refusing the assistant they're using to read it.

Important Allow ChatGPT-User unless you want fetch failures when readers paste your URL When a user asks ChatGPT to read a specific URL (including ChatGPT Search clicking a citation), the fetch comes from ChatGPT-User, not GPTBot. Block it and the user gets a polite "I can't access that site" reply. For most sites this is friction you don't want.
How to do this
Default (allow): omit the group entirely.
To explicitly block:
User-agent: ChatGPT-User Disallow: /
Note the hyphen and capital U — ChatGPT-User.
Check your robots.txt free
Important Allow Claude-User and Claude-Web by default — blocking breaks an explicit user action Claude-User / Claude-Web fire when a Claude user asks the assistant to browse a URL. Same logic as ChatGPT-User — the user explicitly chose to read your site. Decide differently from ClaudeBot if needed; these aren't the same crawler.
How to do this
To block both:
User-agent: Claude-User Disallow: / User-agent: Claude-Web Disallow: /
Anthropic's documentation uses both names across different references — include both groups to be safe.
Important Allow Perplexity-User even if you block PerplexityBot — different mechanic Perplexity separates PerplexityBot (the index crawler) from Perplexity-User (the user-triggered fetcher). Blocking only the index crawler leaves user-triggered behaviour intact, which is usually what publishers want.
How to do this
Block user-triggered (rarely done):
User-agent: Perplexity-User Disallow: /

The syntax mistakes that show up most often in robots.txt code review. Two of these can take a site out of Google.

Critical Use Disallow: / to actually block — empty Disallow allows everything Disallow: with no value means "allow everything for this bot". Authors regularly write User-agent: GPTBot followed by Disallow: (empty) thinking they've blocked it. They've done the opposite.
How to do this
Block correctly:
User-agent: GPTBot Disallow: /
The slash is the URL path to disallow — / means everything from root.
Critical Never Disallow your CSS or JS folders from Googlebot Googlebot renders your pages to score Core Web Vitals (Google's page-speed and layout-stability metrics that affect ranking). If Disallow blocks /assets/ or /static/ from Googlebot, it sees a broken-looking page and ranks you accordingly. In our reviews, this is the single most common SEO-breaking robots.txt mistake.
How to do this
To find the right paths: open the site in Chrome DevTools → Network tab → reload. Note which folder paths the CSS and JS files load from — those are the folders you must not Disallow for Googlebot.
If you already have a wildcard rule blocking them, carve out an explicit Googlebot allow:
User-agent: Googlebot Allow: /assets/ User-agent: * Disallow: /assets/
Check your robots.txt free
Important Duplicate your wildcard rules into every named-bot group — they don't cascade User-agent: * is the fallback for crawlers without their own group, NOT a base layer other groups inherit from. If you have a User-agent: GPTBot group AND a User-agent: * group, GPTBot reads ITS group only — it never sees the wildcard rules. People assume rules compound; they don't.

How to do this

Step 1: open the file, list every named-bot group.
Step 2: copy every Disallow: and Allow: line from the wildcard User-agent: * group into each named group.
Step 3: only THEN add the bot-specific rules at the bottom of each named group. Save.
Important Add a Sitemap: directive at the end of the file Sitemap: is the one robots.txt line every crawler reads — Google, Bing, Perplexity, GPTBot when allowed — and it shortens discovery time for new URLs at zero cost.
How to do this
Last line of the file:
Sitemap: https://yoursite.com/sitemap.xml
If you split a multi-sitemap setup, one Sitemap: line per sitemap. The same precaution applies to canonical tags — a single bad rel="canonical" can take a profitable page out of Google overnight, and our canonical tag checker validates the same set of pages AI crawlers are about to fetch.
Important Comment every group with the reason it exists At 20-plus items across 5 categories the no-comments version of this file becomes unauditable within two quarters. Comments aren't decoration — they're the reason the next person doesn't undo your work.
How to do this
# Block AI training crawlers — content licensing policy, 2026-05-26 User-agent: GPTBot Disallow: / # Allow Perplexity index — we want SaaS citations User-agent: PerplexityBot Allow: /

Six checks between "saved the file" and "pushed to production". Skipping these is how a typo silently breaks a site for six weeks — Google takes 4-8 weeks of recrawl cycles to fully reverse a bad robots.txt.

Critical Verify the production robots.txt isn't the staging one Migration-day classic. Staging usually has User-agent: * / Disallow: / to keep out crawlers. Deploy pushes staging's robots.txt to prod, and the site disappears from Google in 48 hours. This is the single most common cause of post-migration traffic drops.

How to do this

After every deploy, curl https://yoursite.com/robots.txt and visually compare it to the file in your repo. If they differ, the deploy didn't ship the right file — fix the build pipeline before the next Googlebot fetch.
Critical Test the file against every named crawler before deploying A typo in ClaudeBot or Google-Extended will silently fail — most free standalone robots.txt testers only validate Googlebot and Bingbot and won't flag it. Run every change through a tester that knows the AI user-agent names.

How to do this

Open our free robots.txt tester. Paste the file content into the textarea, type the path you want to test (e.g. /blog/) in the path input, and the verdict appears per crawler. The tester checks six crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Google-Extended — and flags AI-crawler blocks so you know what you're trading away.
For Sitebulb or Screaming Frog users: their config-level robots.txt tester accepts custom user-agents and does the same job from inside the crawler config.
Check your robots.txt free
Critical Spell-check every user-agent name against the reference list Typos in user-agent names silently fail open. Claude-Bot (with hyphen) does nothing — the real name is ClaudeBot. GoogleExtended (no hyphen) does nothing — real name Google-Extended. A typo costs you several weeks of being unprotected while you think the site is sealed.
How to do this
Reference list, copy verbatim. Each name on its own line:
GPTBot — one word, capital G + B
ClaudeBot — one word, no hyphen
Claude-User — with hyphen
Claude-Web — with hyphen
Claude-SearchBot — if you also want to opt out of search indexing
PerplexityBot — one word
Perplexity-User — with hyphen
Google-Extended — with hyphen
CCBot — capital C C
ChatGPT-User — with hyphen, capital U
Critical Test /, /pricing, and /blog against Googlebot after every change Adding AI-bot rules has a way of accidentally tightening the wildcard group, which controls Googlebot. After every robots.txt change, smoke-test the rules that apply to Googlebot against the homepage, pricing, /blog, and your top 10 organic-traffic pages from Search Console (Performance → Pages, sort by clicks).

How to do this

In our tester, test each important path — the verdict for Googlebot must be ALLOWED. If any return DISALLOWED, fix the wildcard group before deploying.
Check your robots.txt free
Critical Check Search Console > Settings > robots.txt report after deploy Local testers tell you what your rules SAY. Google Search Console tells you what Google actually fetched and parsed — including CDN-staleness or rewrite issues a local test can't catch. If Google's view of the file disagrees with your local one, fix the delivery pipeline.

How to do this

GSC → Settings → robots.txt — view the latest fetch. Confirm: (1) the fetch is HTTP 200, (2) the file content matches your repo, (3) there are no parse errors flagged in the report. If anything looks wrong, click "Request a recrawl" after fixing.
Important Schedule the next robots.txt audit in 90 days New AI products launch new user-agents constantly. Apple Intelligence added Applebot-Extended in 2024. ByteDance runs Bytespider; Amazon runs Amazonbot; Anthropic split out Claude-SearchBot. Without a calendar trigger this file is stale within twelve months.

How to do this

Add a recurring 30-minute calendar event every 90 days: "Re-audit robots.txt". On that date, walk the Critical and Important items above, plus check the Google crawlers overview for any new bots announced since last audit.

About the Author

Andrei

SEO and digital marketing professional with 13+ years of experience. Started as a website administrator in 2011, transitioned to SEO, and achieved top-3 rankings for competitive keywords. Co-founded a consulting firm specializing in marketing audits for companies in Ukraine and internationally. Built LinkGuard to solve the problem he experienced firsthand: most SEO teams purchase links but never monitor their survival. Based in Kyiv, Ukraine.

Robots.txt for AI crawlers: the complete config checklist (2026)

How to use this checklist

Who this is for

One vocabulary note before we start

What success looks like

Pre-flight understanding

Training crawlers — the four to know

User-triggered fetchers — different mechanic, different decision

Mechanics — write the file without breaking SEO

Verify before pushing live

Tags

About the Author

Andrei

How to use this checklist

Who this is for

One vocabulary note before we start

What success looks like

Tags

About the Author

Andrei

Related Articles

Quarterly backlink audit checklist for agencies: 40 items for 2026

On-page SEO optimization checklist: 35 items for the 2026 audit

Site migration SEO checklist: 37 items for surviving a 2026 domain move