AI CrawlabilityCheck your site

How to Configure robots.txt for AI Crawlers (GPTBot, ClaudeBot, PerplexityBot & More)

AI Crawlability EditorialUpdated June 4, 2026

To be eligible for citation in AI answers, your robots.txt must allow the retrieval crawlers — GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot and Google-Extended — using exact, case-sensitive user-agent tokens. AI crawlers are not blocked by default; they crawl unless you disallow them. You can allow the retrieval bots while disallowing training-only crawlers like CCBot or Bytespider, but keep two honest limits in mind: robots.txt is voluntary, so non-compliant scrapers ignore it, and allowing a bot is necessary but not sufficient for citation — the link between access and being cited is correlational, not proven.

Which AI crawlers should your robots.txt name?

Name the retrieval crawlers explicitly: as of 2026 that means OpenAI's GPTBot, OAI-SearchBot and ChatGPT-User, Anthropic's ClaudeBot, PerplexityBot, and Google-Extended. Think in two groups. Retrieval crawlers fetch content to answer a live user question and can cite you in the response — these are the ones you must allow to be eligible for AI answers. Training crawlers ingest content to train models. Per Google Search Central, robots.txt is the standard way to tell these crawlers which paths they may fetch. The stakes are real: ChatGPT alone reportedly handles on the order of 2.5 billion prompts per day — a figure cited by Genrank without a named primary source, so treat it as indicative of scale rather than a precise measure.

OpenAI is the common trap: it operates three distinct crawlers — GPTBot (training), ChatGPT-User (browsing and retrieval), and OAI-SearchBot (search retrieval) — so a robots.txt that only names GPTBot silently misses the other two. Anthropic uses ClaudeBot and an older anthropic-ai token; the exact relationship between those tokens is not clearly documented, so treat them as separate entries rather than assuming they are equivalent.

Google-Extended behaves differently from the rest: it is a control token, not a separate HTTP user-agent. Actual crawling is performed under existing Google user-agent strings, which means you will not see Google-Extended traffic isolated in your server logs even though the robots.txt token still governs whether your content can be used for Gemini and AI Overviews. Training-only crawlers to be aware of include CCBot (Common Crawl), Bytespider (ByteDance), and Applebot-Extended.

  • Retrieval (can cite you): GPTBot, OAI-SearchBot, ChatGPT-User (OpenAI); ClaudeBot and anthropic-ai (Anthropic); PerplexityBot; Google-Extended.
  • Training-only (optional to block): CCBot (Common Crawl), Bytespider (ByteDance), Applebot-Extended (Apple).
  • The OpenAI trap: naming only GPTBot silently misses ChatGPT-User and OAI-SearchBot.

How do AI crawlers behave by default?

By default, AI crawlers are not blocked — they will crawl your site freely unless you explicitly disallow them in robots.txt. So doing nothing means you are allowing every compliant AI bot, which is usually what you want for visibility.

robots.txt is also not a privacy or invisibility tool. A page you disallow can still be indexed by Google if it is linked from other sites, so disallowing a URL does not guarantee it stays out of results. And opting out of some AI uses is not always a robots.txt job: Bing, for example, uses its crawlers for both search and AI training, and opting out of the AI-training use requires a page-level meta tag rather than a robots.txt directive.

Finally, remember the hard limitation: compliance is voluntary. Well-behaved crawlers honor your rules, but non-compliant scrapers routinely ignore robots.txt, spoof user-agent strings, or use residential IPs to look like ordinary browsers — so robots.txt is the right tool for managing compliant AI engines, not for stopping determined scraping.

What does a good AI-friendly robots.txt look like?

A permissive, citation-friendly configuration names each of the 6 retrieval bots explicitly and allows it. In practice that is a block of User-agent: GPTBot followed by Allow: /, then the same pattern repeated for OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot and Google-Extended, finished with a single Sitemap: https://yourdomain.com/sitemap.xml line so crawlers can discover your pages.

If you want to remain citable while opting out of model training, keep the Allow blocks above for the retrieval bots and add separate blocks that Disallow: / for training-only crawlers such as CCBot, Bytespider, and Applebot-Extended. Be deliberate about this split: the retrieval bots are the ones that can send you AI citations, so blocking them removes you from that engine's answers entirely.

Keep the rules simple and explicit. A short, well-labeled robots.txt that names each AI user agent is easier to audit and less error-prone than clever wildcard logic.

  • User-agent: GPTBot then Allow: / — repeat the pair for OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Google-Extended.
  • Sitemap: https://yourdomain.com/sitemap.xml on its own line so crawlers can discover your pages.
  • To opt out of training only, add User-agent: CCBot then Disallow: / (and the same for Bytespider and Applebot-Extended).

Should you block training crawlers?

You can allow retrieval bots — so you remain eligible for citation — while disallowing training-only crawlers like CCBot, Bytespider, and Applebot-Extended. This is a values and bandwidth decision, not a ranking optimization, because there is no evidence that blocking training crawlers improves or harms your citation outcomes, so decide based on your stance on having your content used for model training.

One caveat on enforcement: the compliance behavior of some training crawlers, including Bytespider and CCBot, is not reliably confirmed, so a Disallow directive is a request that compliant bots honor rather than a guarantee. Community projects such as ai-robots-txt/ai.robots.txt maintain curated lists of AI user-agent strings if you want a maintained reference for which tokens to target.

If your priority is simply being visible in AI answers, the safe default is to allow the retrieval bots and leave training-crawler decisions for later.

What are the most common robots.txt mistakes for AI?

The most damaging mistake is case sensitivity: user-agent tokens are case-sensitive, which means a rule for gptbot does not match GPTBot and will be ignored by compliant crawlers. Always copy the exact casing from each vendor's documentation.

The next most common errors are naming only GPTBot (and missing ChatGPT-User and OAI-SearchBot), using an over-broad wildcard Disallow that accidentally catches AI bots, relying on robots.txt to hide content, and forgetting the Sitemap directive. Each of these quietly reduces your eligibility for AI answers.

Watch your CDN and WAF too. Cloudflare's managed robots.txt and AI block rule (part of Super Bot Fight Mode) can override a custom Allow rule for a bot like GPTBot unless you explicitly disable the managed block first — so a robots.txt that looks permissive can still be overridden at the edge. Cloudflare also offers a managed Content Signals Policy that lets you set preferences such as ai-train=no without hand-editing robots.txt.

  • Case mismatch: a rule for gptbot does not match GPTBot and is ignored by compliant crawlers.
  • Naming only GPTBot, leaving ChatGPT-User and OAI-SearchBot uncovered.
  • An over-broad wildcard Disallow that accidentally catches AI bots.
  • Relying on robots.txt to hide content — a disallowed page can still be indexed if linked elsewhere.
  • Forgetting the Sitemap directive.
  • Edge overrides: a CDN or WAF managed AI block silently overriding your Allow rules.

How do you verify your robots.txt is allowing AI bots?

Run these 4 quick checks. First, open your /robots.txt and confirm each retrieval bot — GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended — is allowed (or at least not disallowed) with exact casing. Second, check your CDN or WAF for managed AI-blocking rules that could override the file. Third, inspect your response headers for an X-Robots-Tag that might restrict access. Fourth, confirm your Sitemap line is present and the sitemap is reachable.

Doing this by hand is fiddly, especially the per-bot and edge-rule checks. Our free AI crawlability checker runs these checks for you across the major AI crawlers and returns a pass or fail per bot with specific fixes.

For the wider context — rendering, schema, and how engines actually choose sources — read the complete AI crawlability guide.

  1. 1Open /robots.txt and confirm each retrieval bot is allowed with exact casing.
  2. 2Check your CDN or WAF for managed AI-blocking rules that could override the file.
  3. 3Inspect response headers for an X-Robots-Tag that restricts access.
  4. 4Confirm the Sitemap directive is present and the sitemap resolves.

What are the key takeaways?

Configuring robots.txt for the major AI crawlers comes down to 5 points: allow the 6 retrieval bots, mind OpenAI's 3 separate crawlers, respect case-sensitive tokens, accept that robots.txt is voluntary, and watch for edge rules that override it.

  • Allow the retrieval bots — GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended — to be eligible for AI citations.
  • OpenAI runs three crawlers, so naming only GPTBot covers just one of them.
  • User-agent tokens are case-sensitive: gptbot does not match GPTBot.
  • robots.txt is voluntary and the access-to-citation link is correlational — necessary, never sufficient.
  • Check edge rules (for example Cloudflare) that can override a permissive robots.txt.

Frequently asked questions

Does allowing AI bots in robots.txt guarantee I will be cited?+

No. Allowing the retrieval bots is necessary to be eligible, but it is not sufficient. No source proves a causal link between access and citation; the relationship is correlational. Clarity, authority, and structure still decide whether you are actually cited.

Will blocking GPTBot remove me from ChatGPT?+

Blocking a retrieval crawler makes you ineligible for that engine's answers. Because OpenAI runs GPTBot, ChatGPT-User, and OAI-SearchBot, be deliberate about which you block — and remember that naming only GPTBot leaves the other two untouched.

Are robots.txt user-agent rules case-sensitive?+

Yes. Tokens are case-sensitive, so gptbot will not match GPTBot and the rule is ignored by compliant crawlers. Always use the exact casing from each vendor's documentation.

Does robots.txt stop AI scrapers from taking my content?+

Not reliably. robots.txt is voluntary: compliant crawlers honor it, but non-compliant scrapers ignore it, spoof user agents, or use residential IPs. It manages well-behaved AI engines, not determined scraping.

Why don't I see Google-Extended in my server logs?+

Google-Extended is a control token rather than a separate HTTP user-agent. Crawling happens under existing Google user-agent strings, so the token governs AI use of your content without appearing as distinct traffic in your logs.

Sources

Free crawlability check

Is your site visible to AI answer engines?

Run a free check across the major AI crawlers — robots.txt, headers, rendering, llms.txt and sitemap — and get specific fixes.

Check your site