21 min readTechnical Guide

ai crawler user agents list for SEO teams in 2026

ai crawler user agents are the named bot tokens and full user-agent strings that let SEO teams separate search visibility crawlers from model-training, user-triggered fetch, and abusive bot traffic. The most important 2026 insight is that OpenAI, Anthropic, Perplexity, Google, and Apple now expose separate controls, so one blanket block can remove citations while still failing to solve server-load risk.

ai crawler user agents guide for deciding which bots to allow, block, monitor, and verify in robots.txt, WAF rules, and server logs.

Server room visual for ai crawler user agents and bot access decisions
Treat AI crawler policy like infrastructure policy: identify the requester, decide the allowed use case, then verify behavior in logs. Image: Michael_Hiraeth via Wikimedia Commons, CC0.

ai crawler user agents should be audited before you change robots.txt AI bots rules, Cloudflare settings, or server-level rate limits. The modern AI crawler list is no longer a single blocklist: OpenAI, Anthropic, Perplexity, Google, Apple, Meta, Common Crawl, and other operators use different identifiers for training, AI-search indexing, user-triggered retrieval, link previews, and crawler-like product checks. If you treat them all as one category, you can accidentally block the exact search surface you wanted to appear in while leaving noisy or spoofed traffic untouched.

Search Roost already has platform-specific guides for Claude-SearchBot robots.txt policy, PerplexityBot access decisions, llms.txt governance, and robots.txt SEO fundamentals. This page connects those pieces into one operating reference: which AI bot user agents matter first, what each one controls, when to allow or block it, and how to prove the request was legitimate before you make it part of an SEO policy.

What are AI crawler user agents and why do they matter for SEO?

AI crawler user agents are identifiers sent in the HTTP `User-Agent` header when an automated system requests a page. In robots.txt, the shorter user-agent token is the control surface: `User-agent: GPTBot`, `User-agent: OAI-SearchBot`, `User-agent: Claude-SearchBot`, and so on. In server logs, WAF rules, and bot analytics, you often see the longer full string, which may include browser-like text plus a bot token and documentation URL.

For SEO, the distinction matters because user-agent policy now changes where content can be discovered, quoted, or excluded. OpenAI documents OAI-SearchBot as the crawler used to surface websites in ChatGPT search features, while GPTBot is tied to training use. Anthropic documents ClaudeBot, Claude-User, and Claude-SearchBot as separate choices. Perplexity documents PerplexityBot and Perplexity-User separately. Google and Apple both expose training-related product tokens that do not behave the same way as their normal search crawlers.

CategoryCommon SEO GoalPolicy Mistake
Search and citation crawlersKeep pages eligible for AI answers and cited resultsBlocking them while trying to improve AI-search visibility
Model-training crawlers or controlsExpress training-data preferences without killing searchAssuming every training opt-out also preserves retrieval
User-triggered fetchersLet assistants retrieve pages when a user asksTreating user fetches like recurring background crawls
Unknown or spoofed agentsProtect capacity, private paths, and expensive endpointsTrusting a user-agent string without IP or DNS verification

The right policy starts by naming the use case. A crawler that can send qualified citation traffic should be reviewed differently from a crawler that only collects training data, and both should be reviewed differently from a random user-agent string hammering faceted URLs. The name alone is not enough; the role is the decision point.

Which AI crawler user agents should most sites audit first?

Start with the agents that can materially affect search visibility or server load. For most public marketing, ecommerce, SaaS, and documentation sites, that means OpenAI, Anthropic, Perplexity, Google, Apple, Common Crawl, Meta, and high-volume commercial crawlers such as Bytespider. You do not need to memorize every obscure crawler on day one. You do need a maintained table for the agents that show up in logs and the agents your leadership expects to influence AI visibility.

User-agent tokenOperatorMain RoleSEO Default
`OAI-SearchBot`OpenAIChatGPT search surfacingUsually allow for visibility
`GPTBot`OpenAIFoundation-model training crawlPolicy decision, not a ranking lever
`ChatGPT-User`OpenAIUser-triggered ChatGPT actions and page visitsVerify separately from search
`Claude-SearchBot`AnthropicClaude search optimization and result qualityUsually allow for Claude visibility
`ClaudeBot`AnthropicTraining-related web collectionSeparate training opt-in or opt-out
`Claude-User`AnthropicUser-directed retrieval inside ClaudeReview if Claude users cite your URLs
`PerplexityBot`PerplexityPerplexity search results and linksUsually allow for Perplexity visibility
`Perplexity-User`PerplexityUser-requested answer supportVerify with IP ranges and logs
`Google-Extended`GoogleProduct token for Gemini training and grounding choicesDoes not control Google Search inclusion
`Applebot`AppleApple search, Spotlight, Siri, and Safari surfacesUsually allow if Apple discovery matters
`Applebot-Extended`AppleTraining-use control for Apple foundation modelsSeparate from Applebot discovery
`CCBot`Common CrawlOpen web crawl dataset used by many downstream systemsPolicy decision based on data-use tolerance
`meta-externalagent`MetaMeta crawler traffic seen in AI and external fetch contextsMonitor closely and avoid breaking link previews
`Bytespider`ByteDanceHigh-volume crawl traffic often discussed as AI data collectionAudit load and apply server controls if abusive

Separate search visibility from training consent

The highest-value split is search versus training. OpenAI says a site can allow OAI-SearchBot for search results while disallowing GPTBot for training preferences. Anthropic makes a similar separation between Claude-SearchBot and ClaudeBot. Apple documents Applebot-Extended as an additional control over generative training use while Applebot can still support search-style discovery. Treat these as different switches, not aliases.

Do not confuse user fetchers with background crawlers

ChatGPT-User, Claude-User, and Perplexity-User are closer to user-triggered retrieval than recurring background crawling. Their traffic can still matter for server behavior and AI answer quality, but policy language should be precise. If a human asks an assistant to fetch your URL, blocking that fetch can reduce visibility in the moment even though it does not answer the larger training-consent question.

The practical policy is not "allow AI" or "block AI." The practical policy is: allow the crawlers that support discovery you want, block or limit uses you do not want, and verify every high-impact rule in production logs.
Code monitor representing AI crawler user agents in server logs and robots.txt rules
User-agent strings are useful, but production policy should pair them with IP verification, response-code monitoring, and path-level log review. Photo: Markus Spiske via Wikimedia Commons.

Should you allow or block AI crawlers in robots.txt?

The safest answer is selective access. Public sites that want visibility in ChatGPT, Claude, Perplexity, Apple search, and Google AI features should usually keep search and retrieval crawlers open unless there is a legal, licensing, privacy, or server-load reason to restrict them. Training-related crawlers are a separate governance decision that should involve legal, editorial, product, and SEO owners.

A blanket `User-agent: * Disallow: /` rule is simple, but it is rarely aligned with a public SEO strategy. It can suppress normal search crawlers, AI search fetchers, image discovery, link-preview systems, and tools you actually rely on. On the other hand, a blanket allow can expose expensive parameter paths, staging artifacts, duplicate URL spaces, and low-value pages that increase crawl waste. The right robots.txt policy is narrow enough to be intentional and boring enough to be maintainable.

Site GoalBetter DefaultWatch For
Earn AI-search citationsAllow OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, and Applebot where relevantWAF rules that override robots.txt
Opt out of model trainingDisallow training-specific tokens where vendors provide themAccidentally blocking search crawlers at the same time
Reduce crawl loadLimit low-value paths and use rate limits for abusive request patternsRelying on voluntary robots rules for hostile traffic
Protect private or paid contentUse authentication, noindex where appropriate, and server-level controlsAssuming robots.txt is an access-control system

A balanced robots.txt pattern for public SEO pages

A visibility-first site can allow search and retrieval agents while disallowing some training tokens. This example is not a universal template, but it shows the policy shape many content teams need:

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Before shipping any version of that rule, compare it with your existing canonical, noindex, sitemap, and CDN behavior. If your site has international pages, faceted ecommerce URLs, gated documentation, or generated pages, pair this with the checks in our AI-ready technical SEO checklist and canonical tags guide.

How do robots.txt rules differ from WAF and IP allowlists?

robots.txt communicates preferences to cooperative crawlers. A WAF decides whether a request is allowed, challenged, rate-limited, or blocked at the edge. IP allowlists help prove that a claimed crawler is coming from infrastructure the vendor publishes or can verify. These systems overlap in bot policy, but they do not do the same job.

OpenAI and Perplexity publish IP range endpoints for several crawler and user-fetcher agents. Perplexity also recommends combining user-agent and IP range checks in WAF rules. Apple documents reverse DNS and CIDR-based verification for Applebot. Anthropic notes that IP blocking is not the preferred opt-out path because crawlers must be able to read robots.txt, but it still provides IP verification information for identifying its crawlers. The common pattern is clear: use robots.txt to state policy, then use network-layer verification to avoid trusting a copied string.

Control LayerBest UseLimitation
robots.txtVendor-specific crawl preferences by user-agent tokenVoluntary and not a privacy or security barrier
WAF rulesBlocking abuse, allowing verified crawlers, protecting expensive pathsCan accidentally override your SEO policy
IP range checksConfirming a crawler is likely operated by the claimed vendorRequires updates when vendors rotate infrastructure
Reverse DNSValidating search crawlers that document host verificationNot every AI crawler offers a clean DNS verification path

This is where many teams break their own AI search visibility. A Next.js route may generate a permissive robots.txt file, but a CDN or WAF setting can still block GPTBot, OAI-SearchBot, PerplexityBot, or Claude-User before the crawler ever reaches your page. That failure will not show up in GA4 because crawlers do not run your analytics like normal users. You need server logs, edge logs, or crawler-specific monitoring.

Fiber optics rack representing AI crawler verification through infrastructure logs
AI crawler verification is an infrastructure workflow as much as an SEO workflow. Image: Sandia National Laboratories via Wikimedia Commons, public domain.

How do you verify AI crawler user agents in server logs?

Verification starts with raw log access. GA4, client-side analytics, and most tag-based tools will miss crawler traffic because bots often do not execute JavaScript. Review the access logs or edge logs that include timestamp, IP address, HTTP method, path, status code, user-agent string, bytes sent, and WAF action. Then group requests by token and purpose.

Step 1: match the token, then verify the source

Search for the documented token inside the full user-agent string: `OAI-SearchBot`, `GPTBot`, `Claude-SearchBot`, `PerplexityBot`, and the rest. A token match is only the first pass. For high-value access decisions, compare the request IP with the vendor's current published ranges or DNS verification method. If there is no documented verification path, treat the traffic as lower-trust and manage it by behavior as well as identity.

Step 2: inspect paths and response codes

The paths matter more than aggregate hits. A verified search crawler requesting canonical articles, product pages, and documentation can be useful. The same crawler hammering parameter URLs, old redirects, or internal search pages may need path restrictions. Look for 403, 404, 429, 500, and redirect chains before concluding that a crawler is helpful or harmful.

# Useful log review fields
timestamp
client_ip
host
method
path
status
bytes_sent
user_agent
waf_action
cache_status
referrer

Step 3: keep an AI crawler change log

Every policy edit should have an owner, date, reason, affected user agents, affected paths, and expected outcome. Without a change log, the team will rediscover the same problem whenever a CDN, CMS, firewall, or framework update changes generated robots.txt behavior. The workflow should resemble the release logging in our log file analysis for crawl budget guide and the measurement discipline in our SEO dashboards and KPI model.

What is the safest AI crawler policy for common site types?

The same AI crawler user agents can imply different policies by site type. A B2B SaaS documentation hub may want maximum retrieval because product answers in ChatGPT and Claude can support demand creation. A paid publisher may care more about licensing control. An ecommerce site may want product discovery but stricter protection around pricing APIs, faceted navigation, and cart endpoints. A regulated portal may need authentication and noindex controls before any bot policy matters.

Site TypeBetter Policy ShapeInternal Link to Review
B2B SaaS and docsAllow search and user retrieval; decide training controls separatelywriting for AI answers
EcommerceAllow product discovery crawlers; restrict carts, filters, and internal searchproduct schema for ecommerce
Editorial publishersBalance citation upside with training, licensing, and paywall policyadding citations to content
Regulated or gated contentUse authentication, noindex, and WAF controls before relying on robots.txtYMYL AI content standards
Large programmatic sitesAllow canonical inventory; block crawl traps, duplicates, and infinite URL spacespagination and infinite scroll SEO

If leadership asks for a binary recommendation, reframe the question. The real decision is not whether AI crawlers are good or bad. The real decision is which crawlers support the business outcomes you want, which crawlers create unacceptable data-use or capacity risk, and which requests cannot be trusted from identity alone.

How often should you update an AI crawler list?

Review the list monthly, and review it immediately after major vendor documentation changes, CDN bot-management changes, product launches, or unexplained log spikes. The May 2026 crawler landscape is changing too quickly for a set-and-forget robots.txt file. Search crawlers, training controls, user fetchers, and agent products are being documented more explicitly, but that also means old policies become stale faster.

Keep a source-of-truth table

Your table should include user-agent token, full user-agent string if published, operator, purpose, official documentation URL, IP verification method, current policy, owner, and last review date. That table belongs in the same governance system as your sitemap, schema templates, redirects, and robots.txt route handler.

Monitor for new agents before writing new blocks

When a new token appears in logs, avoid rushing to a sitewide block. First inspect request volume, paths, response codes, and source network. Then search for official documentation. If no reputable documentation exists and the behavior is expensive or suspicious, treat it as an infrastructure problem rather than an SEO opportunity.

Use content strategy to decide crawler value

Crawler access only matters when the content is worth retrieving. If pages are thin, duplicative, poorly cited, or technically unstable, letting every search crawler in will not create durable AI visibility. Pair this crawler reference with the answer engine optimization checklist and structured data playbook so access, evidence, and page clarity improve together.

FAQ: ai crawler user agents