Which AI crawlers should I allow for AI search visibility?

For visibility, start by reviewing OAI-SearchBot for ChatGPT search, Claude-SearchBot for Claude search, PerplexityBot for Perplexity results, Applebot for Apple search surfaces, and standard Googlebot for Google Search and AI Overviews. Training controls such as GPTBot, ClaudeBot, Google-Extended, and Applebot-Extended are separate decisions.

Is Google-Extended the same as Googlebot?

No. Google documents Google-Extended as a robots.txt product token rather than a separate HTTP user-agent string, and it does not affect inclusion or ranking in Google Search. Googlebot remains the normal crawler for Google Search discovery and indexing.

Can robots.txt stop all AI scraping?

No. robots.txt is an access preference that cooperative crawlers can honor, not a security control. Use robots.txt for documented crawler policy, then use WAF rules, IP verification, rate limits, and log monitoring to handle spoofed, abusive, or undocumented traffic.

How do I verify a real AI crawler?

Match the user-agent token, then verify it against the vendor's published IP ranges or reverse DNS process when available. Log the result by crawler family, URL path, response code, and WAF action so future policy changes are auditable.

May 26, 202621 min readTechnical Guide

ai crawler user agents list for SEO teams in 2026

Q: What are AI crawler user agents?

AI crawler user agents are the bot names and full user-agent strings that AI companies send when their crawlers request a page. SEO teams use them to identify training crawlers, AI search crawlers, user-triggered fetchers, and crawler-like traffic that needs separate policy treatment.

ai crawler user agents are the named bot tokens and full user-agent strings that let SEO teams separate search visibility crawlers from model-training, user-triggered fetch, and abusive bot traffic. The most important 2026 insight is that OpenAI, Anthropic, Perplexity, Google, and Apple now expose separate controls, so one blanket block can remove citations while still failing to solve server-load risk.

ai crawler user agents guide for deciding which bots to allow, block, monitor, and verify in robots.txt, WAF rules, and server logs.

Server room visual for ai crawler user agents and bot access decisions — Treat AI crawler policy like infrastructure policy: identify the requester, decide the allowed use case, then verify behavior in logs. Image: Michael_Hiraeth via Wikimedia Commons, CC0.

ai crawler user agents should be audited before you change robots.txt AI bots rules, Cloudflare settings, or server-level rate limits. The modern AI crawler list is no longer a single blocklist: OpenAI, Anthropic, Perplexity, Google, Apple, Meta, Common Crawl, and other operators use different identifiers for training, AI-search indexing, user-triggered retrieval, link previews, and crawler-like product checks. If you treat them all as one category, you can accidentally block the exact search surface you wanted to appear in while leaving noisy or spoofed traffic untouched.

Search Roost already has platform-specific guides for Claude-SearchBot robots.txt policy, PerplexityBot access decisions, llms.txt governance, and robots.txt SEO fundamentals. This page connects those pieces into one operating reference: which AI bot user agents matter first, what each one controls, when to allow or block it, and how to prove the request was legitimate before you make it part of an SEO policy.

What are AI crawler user agents and why do they matter for SEO?

AI crawler user agents are identifiers sent in the HTTP `User-Agent` header when an automated system requests a page. In robots.txt, the shorter user-agent token is the control surface: `User-agent: GPTBot`, `User-agent: OAI-SearchBot`, `User-agent: Claude-SearchBot`, and so on. In server logs, WAF rules, and bot analytics, you often see the longer full string, which may include browser-like text plus a bot token and documentation URL.

For SEO, the distinction matters because user-agent policy now changes where content can be discovered, quoted, or excluded. OpenAI documents OAI-SearchBot as the crawler used to surface websites in ChatGPT search features, while GPTBot is tied to training use. Anthropic documents ClaudeBot, Claude-User, and Claude-SearchBot as separate choices. Perplexity documents PerplexityBot and Perplexity-User separately. Google and Apple both expose training-related product tokens that do not behave the same way as their normal search crawlers.

Category	Common SEO Goal	Policy Mistake
Search and citation crawlers	Keep pages eligible for AI answers and cited results	Blocking them while trying to improve AI-search visibility
Model-training crawlers or controls	Express training-data preferences without killing search	Assuming every training opt-out also preserves retrieval
User-triggered fetchers	Let assistants retrieve pages when a user asks	Treating user fetches like recurring background crawls
Unknown or spoofed agents	Protect capacity, private paths, and expensive endpoints	Trusting a user-agent string without IP or DNS verification

The right policy starts by naming the use case. A crawler that can send qualified citation traffic should be reviewed differently from a crawler that only collects training data, and both should be reviewed differently from a random user-agent string hammering faceted URLs. The name alone is not enough; the role is the decision point.

Which AI crawler user agents should most sites audit first?

Start with the agents that can materially affect search visibility or server load. For most public marketing, ecommerce, SaaS, and documentation sites, that means OpenAI, Anthropic, Perplexity, Google, Apple, Common Crawl, Meta, and high-volume commercial crawlers such as Bytespider. You do not need to memorize every obscure crawler on day one. You do need a maintained table for the agents that show up in logs and the agents your leadership expects to influence AI visibility.

User-agent token	Operator	Main Role	SEO Default
`OAI-SearchBot`	OpenAI	ChatGPT search surfacing	Usually allow for visibility
`GPTBot`	OpenAI	Foundation-model training crawl	Policy decision, not a ranking lever
`ChatGPT-User`	OpenAI	User-triggered ChatGPT actions and page visits	Verify separately from search
`Claude-SearchBot`	Anthropic	Claude search optimization and result quality	Usually allow for Claude visibility
`ClaudeBot`	Anthropic	Training-related web collection	Separate training opt-in or opt-out
`Claude-User`	Anthropic	User-directed retrieval inside Claude	Review if Claude users cite your URLs
`PerplexityBot`	Perplexity	Perplexity search results and links	Usually allow for Perplexity visibility
`Perplexity-User`	Perplexity	User-requested answer support	Verify with IP ranges and logs
`Google-Extended`	Google	Product token for Gemini training and grounding choices	Does not control Google Search inclusion
`Applebot`	Apple	Apple search, Spotlight, Siri, and Safari surfaces	Usually allow if Apple discovery matters
`Applebot-Extended`	Apple	Training-use control for Apple foundation models	Separate from Applebot discovery
`CCBot`	Common Crawl	Open web crawl dataset used by many downstream systems	Policy decision based on data-use tolerance
`meta-externalagent`	Meta	Meta crawler traffic seen in AI and external fetch contexts	Monitor closely and avoid breaking link previews
`Bytespider`	ByteDance	High-volume crawl traffic often discussed as AI data collection	Audit load and apply server controls if abusive

Separate search visibility from training consent

The highest-value split is search versus training. OpenAI says a site can allow OAI-SearchBot for search results while disallowing GPTBot for training preferences. Anthropic makes a similar separation between Claude-SearchBot and ClaudeBot. Apple documents Applebot-Extended as an additional control over generative training use while Applebot can still support search-style discovery. Treat these as different switches, not aliases.

Do not confuse user fetchers with background crawlers

ChatGPT-User, Claude-User, and Perplexity-User are closer to user-triggered retrieval than recurring background crawling. Their traffic can still matter for server behavior and AI answer quality, but policy language should be precise. If a human asks an assistant to fetch your URL, blocking that fetch can reduce visibility in the moment even though it does not answer the larger training-consent question.

The practical policy is not "allow AI" or "block AI." The practical policy is: allow the crawlers that support discovery you want, block or limit uses you do not want, and verify every high-impact rule in production logs.

Code monitor representing AI crawler user agents in server logs and robots.txt rules — User-agent strings are useful, but production policy should pair them with IP verification, response-code monitoring, and path-level log review. Photo: Markus Spiske via Wikimedia Commons.

Should you allow or block AI crawlers in robots.txt?

The safest answer is selective access. Public sites that want visibility in ChatGPT, Claude, Perplexity, Apple search, and Google AI features should usually keep search and retrieval crawlers open unless there is a legal, licensing, privacy, or server-load reason to restrict them. Training-related crawlers are a separate governance decision that should involve legal, editorial, product, and SEO owners.

A blanket `User-agent: * Disallow: /` rule is simple, but it is rarely aligned with a public SEO strategy. It can suppress normal search crawlers, AI search fetchers, image discovery, link-preview systems, and tools you actually rely on. On the other hand, a blanket allow can expose expensive parameter paths, staging artifacts, duplicate URL spaces, and low-value pages that increase crawl waste. The right robots.txt policy is narrow enough to be intentional and boring enough to be maintainable.

Site Goal	Better Default	Watch For
Earn AI-search citations	Allow OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, and Applebot where relevant	WAF rules that override robots.txt
Opt out of model training	Disallow training-specific tokens where vendors provide them	Accidentally blocking search crawlers at the same time
Reduce crawl load	Limit low-value paths and use rate limits for abusive request patterns	Relying on voluntary robots rules for hostile traffic
Protect private or paid content	Use authentication, noindex where appropriate, and server-level controls	Assuming robots.txt is an access-control system

A balanced robots.txt pattern for public SEO pages

A visibility-first site can allow search and retrieval agents while disallowing some training tokens. This example is not a universal template, but it shows the policy shape many content teams need:

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Before shipping any version of that rule, compare it with your existing canonical, noindex, sitemap, and CDN behavior. If your site has international pages, faceted ecommerce URLs, gated documentation, or generated pages, pair this with the checks in our AI-ready technical SEO checklist and canonical tags guide.

How do robots.txt rules differ from WAF and IP allowlists?

robots.txt communicates preferences to cooperative crawlers. A WAF decides whether a request is allowed, challenged, rate-limited, or blocked at the edge. IP allowlists help prove that a claimed crawler is coming from infrastructure the vendor publishes or can verify. These systems overlap in bot policy, but they do not do the same job.

OpenAI and Perplexity publish IP range endpoints for several crawler and user-fetcher agents. Perplexity also recommends combining user-agent and IP range checks in WAF rules. Apple documents reverse DNS and CIDR-based verification for Applebot. Anthropic notes that IP blocking is not the preferred opt-out path because crawlers must be able to read robots.txt, but it still provides IP verification information for identifying its crawlers. The common pattern is clear: use robots.txt to state policy, then use network-layer verification to avoid trusting a copied string.

Control Layer	Best Use	Limitation
robots.txt	Vendor-specific crawl preferences by user-agent token	Voluntary and not a privacy or security barrier
WAF rules	Blocking abuse, allowing verified crawlers, protecting expensive paths	Can accidentally override your SEO policy
IP range checks	Confirming a crawler is likely operated by the claimed vendor	Requires updates when vendors rotate infrastructure
Reverse DNS	Validating search crawlers that document host verification	Not every AI crawler offers a clean DNS verification path

This is where many teams break their own AI search visibility. A Next.js route may generate a permissive robots.txt file, but a CDN or WAF setting can still block GPTBot, OAI-SearchBot, PerplexityBot, or Claude-User before the crawler ever reaches your page. That failure will not show up in GA4 because crawlers do not run your analytics like normal users. You need server logs, edge logs, or crawler-specific monitoring.

Fiber optics rack representing AI crawler verification through infrastructure logs — AI crawler verification is an infrastructure workflow as much as an SEO workflow. Image: Sandia National Laboratories via Wikimedia Commons, public domain.

How do you verify AI crawler user agents in server logs?

Verification starts with raw log access. GA4, client-side analytics, and most tag-based tools will miss crawler traffic because bots often do not execute JavaScript. Review the access logs or edge logs that include timestamp, IP address, HTTP method, path, status code, user-agent string, bytes sent, and WAF action. Then group requests by token and purpose.

Step 1: match the token, then verify the source

Search for the documented token inside the full user-agent string: `OAI-SearchBot`, `GPTBot`, `Claude-SearchBot`, `PerplexityBot`, and the rest. A token match is only the first pass. For high-value access decisions, compare the request IP with the vendor's current published ranges or DNS verification method. If there is no documented verification path, treat the traffic as lower-trust and manage it by behavior as well as identity.

Step 2: inspect paths and response codes

The paths matter more than aggregate hits. A verified search crawler requesting canonical articles, product pages, and documentation can be useful. The same crawler hammering parameter URLs, old redirects, or internal search pages may need path restrictions. Look for 403, 404, 429, 500, and redirect chains before concluding that a crawler is helpful or harmful.

# Useful log review fields
timestamp
client_ip
host
method
path
status
bytes_sent
user_agent
waf_action
cache_status
referrer

Step 3: keep an AI crawler change log

Every policy edit should have an owner, date, reason, affected user agents, affected paths, and expected outcome. Without a change log, the team will rediscover the same problem whenever a CDN, CMS, firewall, or framework update changes generated robots.txt behavior. The workflow should resemble the release logging in our log file analysis for crawl budget guide and the measurement discipline in our SEO dashboards and KPI model.

What is the safest AI crawler policy for common site types?

The same AI crawler user agents can imply different policies by site type. A B2B SaaS documentation hub may want maximum retrieval because product answers in ChatGPT and Claude can support demand creation. A paid publisher may care more about licensing control. An ecommerce site may want product discovery but stricter protection around pricing APIs, faceted navigation, and cart endpoints. A regulated portal may need authentication and noindex controls before any bot policy matters.

Site Type	Better Policy Shape	Internal Link to Review
B2B SaaS and docs	Allow search and user retrieval; decide training controls separately	writing for AI answers
Ecommerce	Allow product discovery crawlers; restrict carts, filters, and internal search	product schema for ecommerce
Editorial publishers	Balance citation upside with training, licensing, and paywall policy	adding citations to content
Regulated or gated content	Use authentication, noindex, and WAF controls before relying on robots.txt	YMYL AI content standards
Large programmatic sites	Allow canonical inventory; block crawl traps, duplicates, and infinite URL spaces	pagination and infinite scroll SEO

If leadership asks for a binary recommendation, reframe the question. The real decision is not whether AI crawlers are good or bad. The real decision is which crawlers support the business outcomes you want, which crawlers create unacceptable data-use or capacity risk, and which requests cannot be trusted from identity alone.

How often should you update an AI crawler list?

Review the list monthly, and review it immediately after major vendor documentation changes, CDN bot-management changes, product launches, or unexplained log spikes. The May 2026 crawler landscape is changing too quickly for a set-and-forget robots.txt file. Search crawlers, training controls, user fetchers, and agent products are being documented more explicitly, but that also means old policies become stale faster.

Keep a source-of-truth table

Your table should include user-agent token, full user-agent string if published, operator, purpose, official documentation URL, IP verification method, current policy, owner, and last review date. That table belongs in the same governance system as your sitemap, schema templates, redirects, and robots.txt route handler.

Monitor for new agents before writing new blocks

When a new token appears in logs, avoid rushing to a sitewide block. First inspect request volume, paths, response codes, and source network. Then search for official documentation. If no reputable documentation exists and the behavior is expensive or suspicious, treat it as an infrastructure problem rather than an SEO opportunity.

Use content strategy to decide crawler value

Crawler access only matters when the content is worth retrieving. If pages are thin, duplicative, poorly cited, or technically unstable, letting every search crawler in will not create durable AI visibility. Pair this crawler reference with the answer engine optimization checklist and structured data playbook so access, evidence, and page clarity improve together.

FAQ: ai crawler user agents

Sources

Updated May 26, 2026.