AI crawlers: the complete reference
Every known AI bot, its user-agent string, crawling behavior, and whether you should allow or block it. Plus robots.txt templates.
The AI crawlers you need to know#
AI companies send crawlers to index your website content. These crawlers feed data into the AI models that generate responses for users. If you block a crawler, that AI engine can't cite your content.
This reference covers every known AI crawler as of 2026. Each entry includes the user-agent string, the company behind it, what it's used for, and whether you should allow it.
The rule of thumb: allow crawlers for AI platforms where you want citations. Block crawlers for platforms you don't want to index your content (rare, but some publishers choose this for copyright reasons).
ChatGPT crawlers (OpenAI)#
GPTBot User-agent: GPTBot Company: OpenAI Purpose: Crawls web content for ChatGPT's browsing feature and model training Recommendation: Allow. ChatGPT has the largest user base of any AI assistant.
ChatGPT-User User-agent: ChatGPT-User Company: OpenAI Purpose: Real-time web browsing when ChatGPT users ask for current information Recommendation: Allow. This crawler fetches your pages when users query ChatGPT about your brand or industry.
OAI-SearchBot User-agent: OAI-SearchBot Company: OpenAI Purpose: Powers OpenAI's search features and SearchGPT Recommendation: Allow. Growing search product that drives direct traffic.
Claude crawlers (Anthropic)#
ClaudeBot User-agent: ClaudeBot Company: Anthropic Purpose: Web crawling for Claude's knowledge base and training data Recommendation: Allow. Claude is the second-largest AI assistant and growing fast.
anthropic-ai User-agent: anthropic-ai Company: Anthropic Purpose: General web indexing for Anthropic's AI products Recommendation: Allow. Covers Claude and future Anthropic products.
Google AI crawlers#
Google-Extended User-agent: Google-Extended Company: Google Purpose: Crawls content for Gemini and Google AI features (AI Overviews) Recommendation: Allow. Google AI Overviews appear in 47%+ of searches.
Note: Google-Extended is separate from Googlebot. Blocking Google-Extended doesn't affect your traditional Google search ranking. It only affects AI features. Most sites should allow both.
Perplexity crawlers#
PerplexityBot User-agent: PerplexityBot Company: Perplexity AI Purpose: Real-time web crawling for Perplexity's answer engine Recommendation: Allow. Perplexity is the most citation-friendly AI platform. It attributes sources with links.
Perplexity is unique because it always cites sources with clickable links. Getting cited by Perplexity drives direct referral traffic to your site. Blocking PerplexityBot is leaving traffic on the table.
Other AI crawlers#
Bytespider User-agent: Bytespider Company: ByteDance Purpose: Web crawling for ByteDance's AI products Recommendation: Allow if you serve global audiences. Consider blocking if ByteDance products aren't relevant to your market.
CCBot User-agent: CCBot Company: Common Crawl Purpose: Open web crawling used by many AI training pipelines Recommendation: Allow. Common Crawl data feeds into open-source models (Llama, DeepSeek, Mistral). Blocking it reduces your presence in open-source AI.
Track your AI visibility for free
See how ChatGPT, Claude, Gemini, and 4 other AI platforms mention your brand.
Diffbot User-agent: Diffbot Company: Diffbot Purpose: Web scraping and knowledge graph construction used by multiple AI products Recommendation: Allow. Diffbot's knowledge graph feeds into several AI platforms.
Amazonbot User-agent: Amazonbot Company: Amazon Purpose: Web crawling for Alexa and Amazon AI products Recommendation: Allow for consumer-facing brands. Amazon's AI assistant uses this data.
FacebookBot User-agent: FacebookBot Company: Meta Purpose: Web crawling for Meta AI features Recommendation: Allow. Meta AI is integrated into Facebook, Instagram, and WhatsApp.
Applebot-Extended User-agent: Applebot-Extended Company: Apple Purpose: Web crawling for Apple Intelligence features Recommendation: Allow. Apple Intelligence reaches every iPhone, iPad, and Mac user.
cohere-ai User-agent: cohere-ai Company: Cohere Purpose: Web crawling for Cohere's enterprise AI models Recommendation: Allow if you serve enterprise B2B audiences.
robots.txt template: allow all AI crawlers#
Use this template if you want maximum AI visibility across all platforms. This is the recommended setup for most sites.
# AI Crawlers - Allow All
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: FacebookBot
Allow: /
User-agent: cohere-ai
Allow: /
# Reference llms.txt
# LLMs-txt: https://yourdomain.com/llms.txtrobots.txt template: selective access#
Use this template if you want to allow major AI platforms but block others. This gives you citations on the platforms that matter most while limiting data access to smaller players.
# Allow major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /
# Block others
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /How to verify crawler access#
After configuring your robots.txt, verify that AI crawlers can access your content.
Step 1: Check your robots.txt is live at yourdomain.com/robots.txt. Confirm the rules match your intent.
Step 2: Use Google's robots.txt tester (in Google Search Console) to simulate different user agents. Test GPTBot, ClaudeBot, and PerplexityBot specifically.
Step 3: Run a BrandCited scan. The site audit checks crawler access for all 9 AI platforms and flags any blocked bots.
Step 4: Monitor your server logs for AI crawler activity. Look for user-agent strings matching the bots listed above. If you don't see crawler hits within a week of allowing access, check for other blocking mechanisms (WAF rules, CDN settings, rate limiting).
Common blockers to check: Cloudflare bot protection, CDN-level bot blocking rules, server-side WAF rules, and IP-based rate limiting. These can block AI crawlers even when robots.txt allows them.
Frequently asked questions
Will allowing AI crawlers hurt my site performance?
AI crawlers respect robots.txt crawl-delay directives. They typically crawl at low rates. If you experience issues, add a crawl-delay directive rather than blocking the bot entirely.
Can I allow crawling but block training?
Some AI companies offer separate opt-out for training vs. real-time crawling. Check each company's documentation. The robots.txt rules above control crawl access. Training opt-outs are typically separate processes.
What happens if I block all AI crawlers?
Your brand won't appear in AI-generated responses. Users asking ChatGPT, Claude, Perplexity, or Gemini about your industry will see competitors cited instead. For most brands, this is a significant missed opportunity.
How often should I update my robots.txt?
Review it quarterly. New AI crawlers emerge regularly. Check this reference guide for updates and add new bots as they appear.
Was this guide helpful?
Related guides
Put this into practice
Run a free BrandCited scan and see how your site scores on the factors covered in this guide.
Try BrandCited freeGet weekly AI visibility tips
New guides, platform updates, and practitioner case studies. Every Tuesday.