ai crawler user agents list for SEO teams in 2026
ai crawler user agents are the named bot tokens and full user-agent strings that let SEO teams separate search visibility crawlers from model-training, user-triggered fetch, and abusive bot traffic. The most important 2026 insight is that OpenAI, Anthropic, Perplexity, Google, and Apple now expose separate controls, so one blanket block can remove citations while still failing to solve server-load risk.
ai crawler user agents guide for deciding which bots to allow, block, monitor, and verify in robots.txt, WAF rules, and server logs.

ai crawler user agents should be audited before you change robots.txt AI bots rules, Cloudflare settings, or server-level rate limits. The modern AI crawler list is no longer a single blocklist: OpenAI, Anthropic, Perplexity, Google, Apple, Meta, Common Crawl, and other operators use different identifiers for training, AI-search indexing, user-triggered retrieval, link previews, and crawler-like product checks. If you treat them all as one category, you can accidentally block the exact search surface you wanted to appear in while leaving noisy or spoofed traffic untouched.
Search Roost already has platform-specific guides for Claude-SearchBot robots.txt policy, PerplexityBot access decisions, llms.txt governance, and robots.txt SEO fundamentals. This page connects those pieces into one operating reference: which AI bot user agents matter first, what each one controls, when to allow or block it, and how to prove the request was legitimate before you make it part of an SEO policy.
What are AI crawler user agents and why do they matter for SEO?
AI crawler user agents are identifiers sent in the HTTP `User-Agent` header when an automated system requests a page. In robots.txt, the shorter user-agent token is the control surface: `User-agent: GPTBot`, `User-agent: OAI-SearchBot`, `User-agent: Claude-SearchBot`, and so on. In server logs, WAF rules, and bot analytics, you often see the longer full string, which may include browser-like text plus a bot token and documentation URL.
For SEO, the distinction matters because user-agent policy now changes where content can be discovered, quoted, or excluded. OpenAI documents OAI-SearchBot as the crawler used to surface websites in ChatGPT search features, while GPTBot is tied to training use. Anthropic documents ClaudeBot, Claude-User, and Claude-SearchBot as separate choices. Perplexity documents PerplexityBot and Perplexity-User separately. Google and Apple both expose training-related product tokens that do not behave the same way as their normal search crawlers.
| Category | Common SEO Goal | Policy Mistake |
|---|---|---|
| Search and citation crawlers | Keep pages eligible for AI answers and cited results | Blocking them while trying to improve AI-search visibility |
| Model-training crawlers or controls | Express training-data preferences without killing search | Assuming every training opt-out also preserves retrieval |
| User-triggered fetchers | Let assistants retrieve pages when a user asks | Treating user fetches like recurring background crawls |
| Unknown or spoofed agents | Protect capacity, private paths, and expensive endpoints | Trusting a user-agent string without IP or DNS verification |
The right policy starts by naming the use case. A crawler that can send qualified citation traffic should be reviewed differently from a crawler that only collects training data, and both should be reviewed differently from a random user-agent string hammering faceted URLs. The name alone is not enough; the role is the decision point.
Which AI crawler user agents should most sites audit first?
Start with the agents that can materially affect search visibility or server load. For most public marketing, ecommerce, SaaS, and documentation sites, that means OpenAI, Anthropic, Perplexity, Google, Apple, Common Crawl, Meta, and high-volume commercial crawlers such as Bytespider. You do not need to memorize every obscure crawler on day one. You do need a maintained table for the agents that show up in logs and the agents your leadership expects to influence AI visibility.
| User-agent token | Operator | Main Role | SEO Default |
|---|---|---|---|
| `OAI-SearchBot` | OpenAI | ChatGPT search surfacing | Usually allow for visibility |
| `GPTBot` | OpenAI | Foundation-model training crawl | Policy decision, not a ranking lever |
| `ChatGPT-User` | OpenAI | User-triggered ChatGPT actions and page visits | Verify separately from search |
| `Claude-SearchBot` | Anthropic | Claude search optimization and result quality | Usually allow for Claude visibility |
| `ClaudeBot` | Anthropic | Training-related web collection | Separate training opt-in or opt-out |
| `Claude-User` | Anthropic | User-directed retrieval inside Claude | Review if Claude users cite your URLs |
| `PerplexityBot` | Perplexity | Perplexity search results and links | Usually allow for Perplexity visibility |
| `Perplexity-User` | Perplexity | User-requested answer support | Verify with IP ranges and logs |
| `Google-Extended` | Product token for Gemini training and grounding choices | Does not control Google Search inclusion | |
| `Applebot` | Apple | Apple search, Spotlight, Siri, and Safari surfaces | Usually allow if Apple discovery matters |
| `Applebot-Extended` | Apple | Training-use control for Apple foundation models | Separate from Applebot discovery |
| `CCBot` | Common Crawl | Open web crawl dataset used by many downstream systems | Policy decision based on data-use tolerance |
| `meta-externalagent` | Meta | Meta crawler traffic seen in AI and external fetch contexts | Monitor closely and avoid breaking link previews |
| `Bytespider` | ByteDance | High-volume crawl traffic often discussed as AI data collection | Audit load and apply server controls if abusive |
Separate search visibility from training consent
The highest-value split is search versus training. OpenAI says a site can allow OAI-SearchBot for search results while disallowing GPTBot for training preferences. Anthropic makes a similar separation between Claude-SearchBot and ClaudeBot. Apple documents Applebot-Extended as an additional control over generative training use while Applebot can still support search-style discovery. Treat these as different switches, not aliases.
Do not confuse user fetchers with background crawlers
ChatGPT-User, Claude-User, and Perplexity-User are closer to user-triggered retrieval than recurring background crawling. Their traffic can still matter for server behavior and AI answer quality, but policy language should be precise. If a human asks an assistant to fetch your URL, blocking that fetch can reduce visibility in the moment even though it does not answer the larger training-consent question.
The practical policy is not "allow AI" or "block AI." The practical policy is: allow the crawlers that support discovery you want, block or limit uses you do not want, and verify every high-impact rule in production logs.

Should you allow or block AI crawlers in robots.txt?
The safest answer is selective access. Public sites that want visibility in ChatGPT, Claude, Perplexity, Apple search, and Google AI features should usually keep search and retrieval crawlers open unless there is a legal, licensing, privacy, or server-load reason to restrict them. Training-related crawlers are a separate governance decision that should involve legal, editorial, product, and SEO owners.
A blanket `User-agent: * Disallow: /` rule is simple, but it is rarely aligned with a public SEO strategy. It can suppress normal search crawlers, AI search fetchers, image discovery, link-preview systems, and tools you actually rely on. On the other hand, a blanket allow can expose expensive parameter paths, staging artifacts, duplicate URL spaces, and low-value pages that increase crawl waste. The right robots.txt policy is narrow enough to be intentional and boring enough to be maintainable.
| Site Goal | Better Default | Watch For |
|---|---|---|
| Earn AI-search citations | Allow OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, and Applebot where relevant | WAF rules that override robots.txt |
| Opt out of model training | Disallow training-specific tokens where vendors provide them | Accidentally blocking search crawlers at the same time |
| Reduce crawl load | Limit low-value paths and use rate limits for abusive request patterns | Relying on voluntary robots rules for hostile traffic |
| Protect private or paid content | Use authentication, noindex where appropriate, and server-level controls | Assuming robots.txt is an access-control system |
A balanced robots.txt pattern for public SEO pages
A visibility-first site can allow search and retrieval agents while disallowing some training tokens. This example is not a universal template, but it shows the policy shape many content teams need:
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /Before shipping any version of that rule, compare it with your existing canonical, noindex, sitemap, and CDN behavior. If your site has international pages, faceted ecommerce URLs, gated documentation, or generated pages, pair this with the checks in our AI-ready technical SEO checklist and canonical tags guide.
How do robots.txt rules differ from WAF and IP allowlists?
robots.txt communicates preferences to cooperative crawlers. A WAF decides whether a request is allowed, challenged, rate-limited, or blocked at the edge. IP allowlists help prove that a claimed crawler is coming from infrastructure the vendor publishes or can verify. These systems overlap in bot policy, but they do not do the same job.
OpenAI and Perplexity publish IP range endpoints for several crawler and user-fetcher agents. Perplexity also recommends combining user-agent and IP range checks in WAF rules. Apple documents reverse DNS and CIDR-based verification for Applebot. Anthropic notes that IP blocking is not the preferred opt-out path because crawlers must be able to read robots.txt, but it still provides IP verification information for identifying its crawlers. The common pattern is clear: use robots.txt to state policy, then use network-layer verification to avoid trusting a copied string.
| Control Layer | Best Use | Limitation |
|---|---|---|
| robots.txt | Vendor-specific crawl preferences by user-agent token | Voluntary and not a privacy or security barrier |
| WAF rules | Blocking abuse, allowing verified crawlers, protecting expensive paths | Can accidentally override your SEO policy |
| IP range checks | Confirming a crawler is likely operated by the claimed vendor | Requires updates when vendors rotate infrastructure |
| Reverse DNS | Validating search crawlers that document host verification | Not every AI crawler offers a clean DNS verification path |
This is where many teams break their own AI search visibility. A Next.js route may generate a permissive robots.txt file, but a CDN or WAF setting can still block GPTBot, OAI-SearchBot, PerplexityBot, or Claude-User before the crawler ever reaches your page. That failure will not show up in GA4 because crawlers do not run your analytics like normal users. You need server logs, edge logs, or crawler-specific monitoring.

How do you verify AI crawler user agents in server logs?
Verification starts with raw log access. GA4, client-side analytics, and most tag-based tools will miss crawler traffic because bots often do not execute JavaScript. Review the access logs or edge logs that include timestamp, IP address, HTTP method, path, status code, user-agent string, bytes sent, and WAF action. Then group requests by token and purpose.
Step 1: match the token, then verify the source
Search for the documented token inside the full user-agent string: `OAI-SearchBot`, `GPTBot`, `Claude-SearchBot`, `PerplexityBot`, and the rest. A token match is only the first pass. For high-value access decisions, compare the request IP with the vendor's current published ranges or DNS verification method. If there is no documented verification path, treat the traffic as lower-trust and manage it by behavior as well as identity.
Step 2: inspect paths and response codes
The paths matter more than aggregate hits. A verified search crawler requesting canonical articles, product pages, and documentation can be useful. The same crawler hammering parameter URLs, old redirects, or internal search pages may need path restrictions. Look for 403, 404, 429, 500, and redirect chains before concluding that a crawler is helpful or harmful.
# Useful log review fields
timestamp
client_ip
host
method
path
status
bytes_sent
user_agent
waf_action
cache_status
referrerStep 3: keep an AI crawler change log
Every policy edit should have an owner, date, reason, affected user agents, affected paths, and expected outcome. Without a change log, the team will rediscover the same problem whenever a CDN, CMS, firewall, or framework update changes generated robots.txt behavior. The workflow should resemble the release logging in our log file analysis for crawl budget guide and the measurement discipline in our SEO dashboards and KPI model.
What is the safest AI crawler policy for common site types?
The same AI crawler user agents can imply different policies by site type. A B2B SaaS documentation hub may want maximum retrieval because product answers in ChatGPT and Claude can support demand creation. A paid publisher may care more about licensing control. An ecommerce site may want product discovery but stricter protection around pricing APIs, faceted navigation, and cart endpoints. A regulated portal may need authentication and noindex controls before any bot policy matters.
| Site Type | Better Policy Shape | Internal Link to Review |
|---|---|---|
| B2B SaaS and docs | Allow search and user retrieval; decide training controls separately | writing for AI answers |
| Ecommerce | Allow product discovery crawlers; restrict carts, filters, and internal search | product schema for ecommerce |
| Editorial publishers | Balance citation upside with training, licensing, and paywall policy | adding citations to content |
| Regulated or gated content | Use authentication, noindex, and WAF controls before relying on robots.txt | YMYL AI content standards |
| Large programmatic sites | Allow canonical inventory; block crawl traps, duplicates, and infinite URL spaces | pagination and infinite scroll SEO |
If leadership asks for a binary recommendation, reframe the question. The real decision is not whether AI crawlers are good or bad. The real decision is which crawlers support the business outcomes you want, which crawlers create unacceptable data-use or capacity risk, and which requests cannot be trusted from identity alone.
How often should you update an AI crawler list?
Review the list monthly, and review it immediately after major vendor documentation changes, CDN bot-management changes, product launches, or unexplained log spikes. The May 2026 crawler landscape is changing too quickly for a set-and-forget robots.txt file. Search crawlers, training controls, user fetchers, and agent products are being documented more explicitly, but that also means old policies become stale faster.
Keep a source-of-truth table
Your table should include user-agent token, full user-agent string if published, operator, purpose, official documentation URL, IP verification method, current policy, owner, and last review date. That table belongs in the same governance system as your sitemap, schema templates, redirects, and robots.txt route handler.
Monitor for new agents before writing new blocks
When a new token appears in logs, avoid rushing to a sitewide block. First inspect request volume, paths, response codes, and source network. Then search for official documentation. If no reputable documentation exists and the behavior is expensive or suspicious, treat it as an infrastructure problem rather than an SEO opportunity.
Use content strategy to decide crawler value
Crawler access only matters when the content is worth retrieving. If pages are thin, duplicative, poorly cited, or technically unstable, letting every search crawler in will not create durable AI visibility. Pair this crawler reference with the answer engine optimization checklist and structured data playbook so access, evidence, and page clarity improve together.
FAQ: ai crawler user agents
Sources
- OpenAI Platform Docs: Overview of OpenAI crawlers
- Anthropic Help Center: Does Anthropic crawl data from the web, and how can site owners block the crawler?
- Perplexity Docs: Perplexity Crawlers
- Google Crawling Infrastructure: Google-Extended
- Apple Support: About Applebot
- Common Crawl: CCBot
Updated May 26, 2026.