robots.txt AI crawlers: configure GPTBot, ClaudeBot and more

The robots.txt file you set up in 2024 is probably wrong now

Most of the robots.txt guides written two years ago told you to block AI crawlers. Block GPTBot, block ClaudeBot, protect your content. That advice made sense at the time, when AI crawlers meant one thing: scraping your content to train a model. It does not make sense now. The bots have multiplied, they have different jobs, and the robots.txt decisions you make today directly affect whether your site shows up in ChatGPT search results, Perplexity answers, and Claude's responses to real user questions.

This is a practical configuration guide for 2026. It covers the correct user-agent strings for each major AI platform, what happens when you block the wrong bot, and the specific syntax that separates training-data protection from AI search visibility. The stakes are real: Digiday data showed ChatGPT referrals growing 52% year-over-year between September and November 2025, with Gemini referral traffic up 388% in the same period. Getting robots.txt AI crawlers wrong is no longer a theoretical SEO concern.

Why one bot per company is no longer accurate

Every major AI platform now runs multiple crawlers with distinct functions, and each one reads your robots.txt rules independently. Blocking one does not block the others.

OpenAI operates three separate user agents: GPTBot (GPTBot/1.1) collects content for model training, OAI-SearchBot (OAI-SearchBot/1.0) builds the index that powers ChatGPT search, and ChatGPT-User (ChatGPT-User/1.0) fetches pages in real time when a user's query requires it. All three have published IP verification files at openai.com/gptbot.json, openai.com/searchbot.json, and openai.com/chatgpt-user.json. OpenAI's official documentation states clearly that sites blocking OAI-SearchBot will not appear in ChatGPT search answers.

Anthropic's architecture is similarly split. ClaudeBot handles training data collection, Claude-User fetches pages when a Claude user's question requires live retrieval, and Claude-SearchBot indexes content for Claude's search results. Anthropic's three-bot structure was the subject of significant coverage in early 2026 precisely because so many sites had written robots.txt rules that only addressed ClaudeBot, leaving the other two completely unrestricted.

Blocking ClaudeBot stops training data collection. It does nothing to Claude-SearchBot or Claude-User.

Google runs Google-Extended as its AI training crawler, separate from Googlebot. Perplexity uses PerplexityBot for indexing and Perplexity-User for retrieval. The pattern is consistent across providers: one bot for training, one or more for search and retrieval.

The user-agent strings that actually work in 2026

Two deprecated Anthropic strings still appear in a lot of published robots.txt templates: Claude-Web and anthropic-ai. Neither is active. Sites still using only those strings to block Anthropic crawlers are not blocking anything. ALM Corp's detailed analysis of the ClaudeBot transition documents exactly this problem and how to fix it.

Current verified strings, by platform:

OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User
Anthropic: ClaudeBot, Claude-User, Claude-SearchBot
Perplexity: PerplexityBot, Perplexity-User
Google AI: Google-Extended
Meta: Meta-ExternalAgent, Meta-ExternalFetcher
Apple: Applebot-Extended
ByteDance/TikTok: Bytespider, TikTokSpider
Amazon: Amazonbot
Common crawlers used in AI training: CCBot, diffbot, Omgili, Omgilibot

If your robots.txt does not list these strings individually, your rules have gaps.

The configuration that fits most publishers

For the majority of content sites, the right strategy is to allow AI search crawlers while blocking training crawlers. This keeps your content out of model training datasets while preserving visibility in AI-generated search answers. The configuration looks like this:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow AI search crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

If you have gated content, subscriber-only sections, or checkout flows, you can add path-level rules under any allow block. For example, under User-agent: OAI-SearchBot, adding Disallow: /members/ keeps premium content protected while your public articles stay indexable.

The site-wide allow-all approach is only appropriate if you have no paywalled content and no concern about training data use.

For sites that want to allow training data use with exceptions, the same path-level syntax applies. Under User-agent: GPTBot, Disallow: /members/ and Disallow: /checkout/ with Allow: / for everything else is a clean and verifiable configuration.

What the blocking data actually shows

Among training crawlers, 69% of sites currently block ClaudeBot and 62% block GPTBot. Those numbers are relatively unsurprising given years of advice to block AI training bots. The more consequential numbers: 49% of sites block OAI-SearchBot and 40% block ChatGPT-User. Nearly half of sites are blocking the crawler that determines whether they appear in ChatGPT search results, in many cases by accident.

From SuggestedByGPT's GEO benchmark data across 100 tracked queries in the last 14 days, our own domain appeared in 10% of AI-generated answers. Semrush appeared in 19%, Profound in 12%, Moz in 12%. The spread across those numbers is not purely a content quality question. Technical accessibility to AI crawlers is part of what separates sites that get cited from sites that don't. You can read more about how AI citation patterns form in our overview of GEO fundamentals.

Sites that block search crawlers are not being conservative about AI. They are opting out of a traffic channel that grew 52% year-over-year.

CDN overrides and the 27% problem

Your robots.txt file might be correct. Your server might still be blocking AI crawlers anyway.

Research cited by ziptie.dev found that approximately 27% of B2B SaaS and ecommerce sites are accidentally blocking major AI crawlers at the CDN layer, not through robots.txt at all. Cloudflare's "Manage your robots.txt" feature, when enabled, overrides the file on your origin server. If you set up careful bot-specific rules in your actual robots.txt and Cloudflare is serving its own version, none of your rules matter.

The fix is straightforward: inside Cloudflare's dashboard, verify that managed robots.txt is disabled and your origin file is the one being served. Then validate the result by fetching yourdomain.com/robots.txt directly and confirming your rules appear. Tools like xSeek's robots checker exist specifically for this validation step. Changes typically take around 24 hours to propagate through OpenAI's systems after you correct the file.

Check your CDN configuration before assuming your robots.txt rules are working.

Perplexity, compliance, and the limits of robots.txt

One honest caveat: robots.txt is a convention, not enforcement. Crawlers comply voluntarily.

Cloudflare published a detailed investigation finding that Perplexity had used undeclared crawlers with generic user-agent strings to access sites that had explicitly blocked PerplexityBot. The full investigation and related coverage document this as a known and contested issue. Perplexity disputed parts of the characterization, but the core finding that generic user-agent strings can bypass named-bot rules remains relevant.

For sites where blocking Perplexity is a hard requirement rather than a preference, IP-range blocking at the server or CDN level is more reliable than robots.txt alone. For sites that simply want to opt out of training data collection, the robots.txt approach is adequate for all compliant crawlers, and the major platforms (OpenAI, Anthropic, Google) have documented compliance track records.

Robots.txt is the right starting point. It is not a complete enforcement layer.

Cloudflare has also introduced Content Signals, a set of machine-readable directives that categorize crawler permissions across three categories: search (building a search index), ai-input (feeding content into AI models for real-time answers), and ai-train (training or fine-tuning AI models). This approach goes further than standard robots.txt syntax and reflects where the technical standard is heading.

Putting the configuration into practice

Start with your current robots.txt file and check it against the verified user-agent list above. If you see Claude-Web or anthropic-ai as your only Anthropic blocks, add ClaudeBot, Claude-User, and Claude-SearchBot. If you see a blanket Disallow: / under every AI-named bot, separate your training-crawler rules from your search-crawler rules.

Validate at the CDN layer. Fetch your robots.txt directly and confirm what's actually being served. Run a quick check on any Cloudflare or other CDN settings that might be overriding your file.

Decide on a clear policy before writing rules. The decision about whether to allow AI search crawlers, block training crawlers, or both is a business decision, not just a technical one. The configuration follows the policy, not the other way around. Treat the two-tier architecture (search vs. training) as the fundamental unit of your thinking, because that is how every major AI platform has structured its crawlers.

The sites gaining AI search visibility in 2026 are the ones that made deliberate choices about which bots to allow, set up their robots.txt AI crawlers configuration to reflect those choices accurately, and verified that their infrastructure was actually serving the right file. That is the entire playbook.

If you want to see how your site currently appears across AI search platforms and where your robots.txt configuration may be costing you citations, SuggestedByGPT tracks AI mention data across the major platforms and can show you exactly where you stand. Start with a free account at https://suggestedbygpt.com/start and get your baseline before making further changes.