AI Crawler Visibility Checker
Find out whether ChatGPT search, GPTBot, Claude, Perplexity, and classic search bots can access your site — and generate a clean robots.txt + llms.txt based on your goals.
Check a website
Enter a homepage or any URL. We’ll inspect robots.txt + on-page blocking signals, then run safe HEAD checks with bot user agents.
What you’ll get
How to control AI crawlers (without harming SEO)
Site owners are increasingly choosing a policy like: allow AI search (so you can be cited/linked), but block training crawlers (so your content isn’t collected for broad model training). The safest place to start is robots.txt — then add WAF/CDN enforcement if you need stronger control.
AI Search vs Training vs User-initiated retrieval
Not all “AI bots” are the same. Some are associated with search-style retrieval; others are training crawlers; others fetch pages only when a user asks. Your policy can allow one category and block another.
- AI Search: you may want this for discoverability and citations.
- Training crawlers: many sites choose to block these.
- User-initiated retrieval: a user action may trigger a fetch even when broad crawling is restricted.
Common mistakes that accidentally tank rankings
- Blocking User-agent: * with Disallow: / (this can block Google/Bing too).
- Returning 403 / 429 for bot traffic due to WAF rules or rate limits.
- Sending X-Robots-Tag: noindex or meta robots noindex on templates.
- Using inconsistent canonicals (different host/protocol) during redirects.
This tool checks those signals so you can fix the real blocker, not guess.
Fast verification checklist
- robots.txt is reachable and correct.
- Homepage returns 200/301 (not 403/429) for intended bots.
- No noindex in headers or meta tags.
- WAF allows your chosen bots (or at least doesn’t blanket block them).
Robots.txt templates library (copy/paste)
Below are practical patterns people search for. These are safe starting points, but always tailor them to your site and policy. Tip: avoid blocking User-agent: * unless you’re intentionally blocking most crawling.
Policy A: Allow AI search, block training (recommended for many publishers)
# Allow classic search User-agent: * Disallow: # Allow AI search User-agent: OAI-SearchBot Disallow: User-agent: Claude-SearchBot Disallow: # Block common training crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / # Optional: restrict other large crawlers User-agent: CCBot Disallow: / Sitemap: https://example.com/sitemap.xml
Use when you want visibility/citations but don’t want broad training crawling.
Policy B: Block all AI bots (while keeping SEO)
# Keep search engines allowed User-agent: * Disallow: # Block common AI bots User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-SearchBot Disallow: / User-agent: Claude-User Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Perplexity-User Disallow: / Sitemap: https://example.com/sitemap.xml
Use when you want to opt out of AI crawling entirely.
Policy C: Allow AI + Search everywhere
User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml
Use when you’re fully open to crawling by all compliant bots.
Policy D: Block only sensitive areas
User-agent: * Disallow: # Block sensitive/private areas for everyone Disallow: /account/ Disallow: /checkout/ Disallow: /admin/ Disallow: /api/ Sitemap: https://example.com/sitemap.xml
Best practice: block areas that should never be indexed by any crawler.
Cloudflare / WAF blocks (403, 429): what it usually means
If your “Homepage test” shows 403 Forbidden or 429 Too Many Requests for bot-style user agents, robots.txt may not be the issue — your WAF/rate limiting is.
403 Forbidden
Often caused by bot protection rules, “fight mode”, country blocks, or strict firewall rules that treat bots as threats.
- Allowlist verified bots if that’s your policy.
- Reduce false positives: block bad paths rather than whole UAs.
- Ensure your own server isn’t returning 403 for HEAD requests.
429 Too Many Requests
Usually rate limiting. Even a few diagnostic requests can trigger it if your limits are low.
- Increase rate limits for good bots you want to allow.
- Cache your homepage and robots.txt aggressively.
- Consider bot-specific rules rather than global throttles.
What to do next
Use the templates above to set intent (robots.txt), then adjust WAF rules so your “allowed” bots actually receive 200/301 responses.
If you block bots at the WAF layer, robots.txt alone won’t help.
What is llms.txt and why people add it
llms.txt is a simple, human-readable file that points AI systems at your most useful pages (and away from noise). It’s not a formal standard, but it’s popular because it’s easy to add and improves “first impressions” for AI retrieval.
Good llms.txt contents
- Primary pages (home, docs, pricing, key guides)
- Sitemap link
- Preferred canonical patterns
- Notes about blocked areas
This tool generates a safe draft you can edit and publish at /llms.txt.
Example llms.txt
# llms.txt > Purpose: Help AI systems find the best pages. ## Primary - https://example.com/ - https://example.com/sitemap.xml ## Best resources - https://example.com/tools/ - https://example.com/guides/ ## Notes - Prefer canonical URLs - Avoid crawling /account/ and /admin/
Bot glossary (quick meanings)
- OAI-SearchBot: AI search-style retrieval bot token.
- GPTBot: commonly associated with broad crawling for training datasets.
- ChatGPT-User: user-initiated retrieval token (fetching triggered by a user action).
- Claude-SearchBot / ClaudeBot: Anthropic equivalents for search vs training.
- PerplexityBot: Perplexity crawler token (treat like AI search; enforce via WAF if needed).
- Googlebot / bingbot: classic search engine indexing bots.
FAQ
What is the difference between GPTBot and OAI-SearchBot?
They’re different user agents. Sites can allow one and block the other using robots.txt. OAI-SearchBot is typically associated with search-style retrieval, while GPTBot is commonly associated with broader crawling for model improvement datasets. Your policy may differ depending on your goals.
If I block AI bots in robots.txt, can they still access my site?
Robots.txt is a directive for compliant crawlers. Some systems may still fetch pages when a user explicitly requests them (using different user agents), and non-compliant bots may ignore robots.txt. If you need stronger enforcement, you typically also use WAF/firewall rules.
Will allowing AI search bots help me appear in AI answers?
Potentially. If an AI search system can crawl and retrieve your pages, it may be more likely to cite and link to you. However, inclusion also depends on relevance, quality, and whether your content is accessible at crawl time.
What is llms.txt and do I need it?
llms.txt is a simple, human-readable file that can point LLMs/AI systems to the most useful sections of your site. It isn’t a formal web standard, but some sites add it because it’s easy and can improve how AI retrieval systems discover key pages.
Why do AI bots get 403 or 429 errors on Cloudflare?
403/429 responses are usually WAF or rate-limiting behavior rather than robots.txt. If you intend to allow a bot, you may need to adjust firewall rules, bot protections, or rate limits so allowed bots receive normal 200/301 responses.
How do I know if I’m blocking AI crawlers on my website?
Check your robots.txt for bot-specific Disallow rules, then verify what your server actually returns to bot-style requests (status codes like 403/429, and headers like X-Robots-Tag). This tool combines robots.txt evaluation with real response checks.
What robots.txt should I use to block AI training but allow AI search?
A common approach is to allow AI search user agents while disallowing training crawlers (for example, allow OAI-SearchBot but disallow GPTBot). Always avoid blocking User-agent: * unless you want to block most crawlers, including search engines.
Does robots.txt affect Google rankings or indexing?
Yes. If you block important sections (or block User-agent: * with Disallow: /), search engines may not crawl your pages, which can prevent indexing and harm visibility. That’s why it’s safer to use targeted rules instead of blanket blocks.
Why is robots.txt allowed but the bot still can’t access my site?
Because robots.txt is only one layer. WAF/CDN rules, authentication, geo-blocking, bot protection, or rate limits can still return 403/429. Also, some servers treat HEAD requests differently than GET requests.
What’s the safest way to block bots without breaking SEO?
Start by keeping User-agent: * open (or only blocking truly private paths like /admin/ and /account/), then add bot-specific rules for AI crawlers you want to restrict. If you need enforcement, apply WAF rules that target specific user agents and paths rather than blanket blocks.