TechnicalAdvancedTechnical Implementation · Guide 3 of 7

robots.txt for AI: which bots to allow and block

A complete configuration guide for managing AI crawlers. Learn the difference between training bots and search bots, and choose the right access strategy.

14 min read

Updated April 2026

Training bots vs search bots: the distinction that matters#

AI companies now use separate crawlers for different purposes, and understanding this distinction changes your robots.txt strategy.

Training bots collect web content to train AI models. Your content becomes part of the model's knowledge for future versions. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), and CCBot (Common Crawl) are training bots. Blocking them prevents your content from entering training datasets.

Search bots crawl web content for real-time retrieval when users ask questions. ChatGPT-User, OAI-SearchBot, Claude-SearchBot, and PerplexityBot fetch your pages in real time to answer current queries. Blocking them prevents your content from appearing in AI-generated answers.

The strategic middle ground: many sites block training bots to protect intellectual property while allowing search bots to maintain citation visibility. This approach lets your content appear in AI answers today without contributing to future training datasets. Cloudflare's Q1 2026 analysis found that 69% of sites block ClaudeBot (training) but fewer block the search-specific crawlers.

Every AI crawler you need to know about#

Here is the complete roster of AI crawlers as of April 2026, organized by company.

OpenAI operates GPTBot (training), ChatGPT-User (real-time browsing), and OAI-SearchBot (search features). Allow at minimum ChatGPT-User and OAI-SearchBot for citation visibility.

Anthropic operates ClaudeBot (training) and Claude-SearchBot (real-time search). Blocking ClaudeBot does not affect Claude-SearchBot. Allow Claude-SearchBot for real-time citations.

Google operates Google-Extended (AI features including AI Overviews and Gemini). Blocking it prevents AI feature appearance while keeping traditional Google ranking intact.

Perplexity operates PerplexityBot. It is primarily a real-time retrieval bot since Perplexity answers every query with live web searches. Blocking means zero Perplexity visibility.

Apple operates Applebot-Extended for Apple Intelligence features across iPhone, iPad, and Mac. Separate from the standard Applebot used for Siri web results.

Meta operates FacebookBot for Meta AI features across Facebook, Instagram, and WhatsApp.

Common Crawl operates CCBot. Its open web dataset feeds into dozens of open-source models including DeepSeek, Llama, and Mistral.

ByteDance operates Bytespider for its AI products.

Amazon operates Amazonbot for Alexa and Amazon AI features.

Strategy 1: Maximum visibility (recommended for most brands)#

Allow all AI crawlers to maximize your citation surface across every platform. This is the recommended approach for brands whose content supports a product or service (SaaS companies, e-commerce, professional services, agencies). Your content is a marketing asset. Wider distribution drives more visibility and traffic.

# Maximum AI visibility - allow all bots
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: FacebookBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Amazonbot
Allow: /

# Crawl rate limiting (optional)
Crawl-delay: 2

Strategy 2: Citation visibility without training data contribution#

Track your AI visibility for free

See how ChatGPT, Claude, Gemini, and 4 other AI platforms mention your brand.

Start free scan

Allow real-time search bots while blocking training bots. This approach works for publishers, media companies, and content creators who want their work cited in AI answers but do not want it used to train future model versions.

# Allow real-time AI search bots
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

# Block training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Strategy 3: Selective platform access#

Allow only the AI platforms where your audience is active. This approach suits brands with a clear picture of where their customers use AI search.

# Example: B2B brand focused on major Western AI platforms
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

# Block platforms outside your target market
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

Common mistakes that break AI visibility#

Mistake 1: Using a blanket "User-agent: * / Disallow: /" rule without exceptions. This blocks every AI crawler along with everything else. If you use a wildcard block, add explicit Allow rules for each AI crawler you want.

Mistake 2: Blocking bots at the infrastructure level while allowing them in robots.txt. Cloudflare Bot Fight Mode, Sucuri WAF, and similar security tools block bots regardless of robots.txt. Check your CDN and WAF settings. A robots.txt that says "Allow" means nothing if your firewall rejects the request.

Mistake 3: Forgetting to test after changes. After updating robots.txt, verify that each AI crawler can actually access your pages. Use Google's robots.txt tester for Google-Extended. Check server logs for crawler visits within a week. BrandCited's audit checks crawler access across all platforms.

Mistake 4: Not setting crawl-delay. AI crawlers respect crawl-delay directives. Without one, aggressive crawling during peak hours could impact your server performance. A crawl-delay of 2-5 seconds prevents server issues without meaningfully delaying indexing.

Mistake 5: Blocking the wrong bot. Blocking ClaudeBot (training) when you meant to allow Claude-SearchBot (real-time search), or vice versa. Double-check user-agent strings. One character difference can block the wrong crawler.

Verification and monitoring#

After configuring your robots.txt, verify access through three methods.

First, test the file directly. Visit yourdomain.com/robots.txt and confirm the rules match your intent. Check for syntax errors. A missing newline between rules or a typo in a user-agent string can invalidate the directive.

Second, check server logs. Look for HTTP 200 responses to requests from AI crawler user-agents. If you see no activity within two weeks of allowing a crawler, something beyond robots.txt is blocking it.

Third, run a BrandCited audit. The scan tests crawler access for all 9 tracked AI platforms and flags any blocked bots. It also detects infrastructure-level blocking that robots.txt cannot control.

Review your robots.txt quarterly. New AI crawlers emerge regularly, and existing ones update their user-agent strings. The AI crawler landscape in Q4 2026 will include bots that do not exist in Q1 2026. Stay current or lose visibility on emerging platforms.

Frequently asked questions

Should I block AI training bots?

It depends on your content model. If content is your product (news, research, premium analysis), blocking training bots while allowing search bots is a reasonable middle ground. If content supports a product or service, maximum visibility through allowing all bots typically wins.

Does blocking GPTBot affect my ChatGPT visibility?

Partially. Blocking GPTBot prevents your content from entering future training data. But allowing ChatGPT-User and OAI-SearchBot still lets ChatGPT cite your content through real-time web browsing.

Can AI crawlers ignore robots.txt?

Reputable AI companies (OpenAI, Anthropic, Google, Perplexity) respect robots.txt. There is no industry-wide legal requirement to obey robots.txt, but major AI companies have committed to following it as part of responsible AI practices.

How often do new AI crawlers appear?

New crawlers emerge every few months as new AI companies launch and existing companies create purpose-specific bots. Review the current crawler list quarterly and update your robots.txt accordingly.

Was this guide helpful?

Next in Technical Implementation

Up next

Structured data that gets you cited: beyond basic schema

14 min read

Related guides

AI crawlers: the complete reference

11 min read

Read guide

The llms.txt complete guide

10 min read

Read guide

Schema markup for AI visibility

12 min read

Read guide

The llms.txt complete guidePrevious Structured data that gets you cited: beyond basic schemaNext

Put this into practice

Run a free BrandCited scan and see how your site scores on the factors covered in this guide.

Try BrandCited free

Get weekly AI visibility tips

New guides, platform updates, and practitioner case studies. Every Tuesday.