AI Crawler Visibility Checker

Find out whether ChatGPT search, GPTBot, Claude, Perplexity, and classic search bots can access your site — and generate a clean robots.txt + llms.txt based on your goals.

Check a website

Enter a homepage or any URL. We’ll inspect robots.txt + on-page blocking signals, then run safe HEAD checks with bot user agents.

Test homepage with bot UAs Fetch /robots.txt Check /llms.txt

Updated 2026-01-29 • This tool is diagnostic (no bypassing, no scraping).

What you’ll get

Bot access matrix • Allow/Block by bot

Robots fixes • Copy/paste templates

Blocking signals • noindex / headers / 403 / 429

llms.txt draft • quick AI-friendly map

Results

Overall

—

robots.txt

—

WAF/CDN signals

—

Detected issues

Key facts

Bot access matrix (robots.txt evaluation)

Bot	Category	Robots access	Homepage test	Notes

“Robots access” is evaluated from robots.txt rules. “Homepage test” is a server-side HEAD request with that bot’s user agent.

Recommended robots.txt (preset)

Choose a goal and copy/paste. Templates — tailor to your policy.

Suggested llms.txt (draft)

Simple AI-friendly map. Not a formal standard.

Raw robots.txt (fetched)

How to control AI crawlers (without harming SEO)

Site owners are increasingly choosing a policy like: allow AI search (so you can be cited/linked), but block training crawlers (so your content isn’t collected for broad model training). The safest place to start is robots.txt — then add WAF/CDN enforcement if you need stronger control.

AI Search vs Training vs User-initiated retrieval

Not all “AI bots” are the same. Some are associated with search-style retrieval; others are training crawlers; others fetch pages only when a user asks. Your policy can allow one category and block another.

AI Search: you may want this for discoverability and citations.
Training crawlers: many sites choose to block these.
User-initiated retrieval: a user action may trigger a fetch even when broad crawling is restricted.

Common mistakes that accidentally tank rankings

Blocking User-agent: * with Disallow: / (this can block Google/Bing too).
Returning 403 / 429 for bot traffic due to WAF rules or rate limits.
Sending X-Robots-Tag: noindex or meta robots noindex on templates.
Using inconsistent canonicals (different host/protocol) during redirects.

This tool checks those signals so you can fix the real blocker, not guess.

Fast verification checklist

robots.txt is reachable and correct.
Homepage returns 200/301 (not 403/429) for intended bots.
No noindex in headers or meta tags.
WAF allows your chosen bots (or at least doesn’t blanket block them).

Robots.txt templates library (copy/paste)

Below are practical patterns people search for. These are safe starting points, but always tailor them to your site and policy. Tip: avoid blocking User-agent: * unless you’re intentionally blocking most crawling.

Policy A: Allow AI search, block training (recommended for many publishers)

# Allow classic search
User-agent: *
Disallow:

# Allow AI search
User-agent: OAI-SearchBot
Disallow:

User-agent: Claude-SearchBot
Disallow:

# Block common training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Optional: restrict other large crawlers
User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Use when you want visibility/citations but don’t want broad training crawling.

Policy B: Block all AI bots (while keeping SEO)

# Keep search engines allowed
User-agent: *
Disallow:

# Block common AI bots
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

Sitemap: https://example.com/sitemap.xml

Use when you want to opt out of AI crawling entirely.

Policy C: Allow AI + Search everywhere

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Use when you’re fully open to crawling by all compliant bots.

Policy D: Block only sensitive areas

User-agent: *
Disallow:

# Block sensitive/private areas for everyone
Disallow: /account/
Disallow: /checkout/
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

Best practice: block areas that should never be indexed by any crawler.

Cloudflare / WAF blocks (403, 429): what it usually means

If your “Homepage test” shows 403 Forbidden or 429 Too Many Requests for bot-style user agents, robots.txt may not be the issue — your WAF/rate limiting is.

403 Forbidden

Often caused by bot protection rules, “fight mode”, country blocks, or strict firewall rules that treat bots as threats.

Allowlist verified bots if that’s your policy.
Reduce false positives: block bad paths rather than whole UAs.
Ensure your own server isn’t returning 403 for HEAD requests.

429 Too Many Requests

Usually rate limiting. Even a few diagnostic requests can trigger it if your limits are low.

Increase rate limits for good bots you want to allow.
Cache your homepage and robots.txt aggressively.
Consider bot-specific rules rather than global throttles.

What to do next

Use the templates above to set intent (robots.txt), then adjust WAF rules so your “allowed” bots actually receive 200/301 responses.

If you block bots at the WAF layer, robots.txt alone won’t help.

What is llms.txt and why people add it

llms.txt is a simple, human-readable file that points AI systems at your most useful pages (and away from noise). It’s not a formal standard, but it’s popular because it’s easy to add and improves “first impressions” for AI retrieval.

Good llms.txt contents

Primary pages (home, docs, pricing, key guides)
Sitemap link
Preferred canonical patterns
Notes about blocked areas

This tool generates a safe draft you can edit and publish at /llms.txt.

Example llms.txt

# llms.txt

> Purpose: Help AI systems find the best pages.

## Primary
- https://example.com/
- https://example.com/sitemap.xml

## Best resources
- https://example.com/tools/
- https://example.com/guides/

## Notes
- Prefer canonical URLs
- Avoid crawling /account/ and /admin/

Bot glossary (quick meanings)

OAI-SearchBot: AI search-style retrieval bot token.
GPTBot: commonly associated with broad crawling for training datasets.
ChatGPT-User: user-initiated retrieval token (fetching triggered by a user action).
Claude-SearchBot / ClaudeBot: Anthropic equivalents for search vs training.
PerplexityBot: Perplexity crawler token (treat like AI search; enforce via WAF if needed).
Googlebot / bingbot: classic search engine indexing bots.

FAQ

What is the difference between GPTBot and OAI-SearchBot?

They’re different user agents. Sites can allow one and block the other using robots.txt. OAI-SearchBot is typically associated with search-style retrieval, while GPTBot is commonly associated with broader crawling for model improvement datasets. Your policy may differ depending on your goals.

If I block AI bots in robots.txt, can they still access my site?

Robots.txt is a directive for compliant crawlers. Some systems may still fetch pages when a user explicitly requests them (using different user agents), and non-compliant bots may ignore robots.txt. If you need stronger enforcement, you typically also use WAF/firewall rules.

Will allowing AI search bots help me appear in AI answers?

Potentially. If an AI search system can crawl and retrieve your pages, it may be more likely to cite and link to you. However, inclusion also depends on relevance, quality, and whether your content is accessible at crawl time.

What is llms.txt and do I need it?

llms.txt is a simple, human-readable file that can point LLMs/AI systems to the most useful sections of your site. It isn’t a formal web standard, but some sites add it because it’s easy and can improve how AI retrieval systems discover key pages.

Why do AI bots get 403 or 429 errors on Cloudflare?

403/429 responses are usually WAF or rate-limiting behavior rather than robots.txt. If you intend to allow a bot, you may need to adjust firewall rules, bot protections, or rate limits so allowed bots receive normal 200/301 responses.

How do I know if I’m blocking AI crawlers on my website?

Check your robots.txt for bot-specific Disallow rules, then verify what your server actually returns to bot-style requests (status codes like 403/429, and headers like X-Robots-Tag). This tool combines robots.txt evaluation with real response checks.

What robots.txt should I use to block AI training but allow AI search?

A common approach is to allow AI search user agents while disallowing training crawlers (for example, allow OAI-SearchBot but disallow GPTBot). Always avoid blocking User-agent: * unless you want to block most crawlers, including search engines.

Does robots.txt affect Google rankings or indexing?

Yes. If you block important sections (or block User-agent: * with Disallow: /), search engines may not crawl your pages, which can prevent indexing and harm visibility. That’s why it’s safer to use targeted rules instead of blanket blocks.

Why is robots.txt allowed but the bot still can’t access my site?

Because robots.txt is only one layer. WAF/CDN rules, authentication, geo-blocking, bot protection, or rate limits can still return 403/429. Also, some servers treat HEAD requests differently than GET requests.

What’s the safest way to block bots without breaking SEO?

Start by keeping User-agent: * open (or only blocking truly private paths like /admin/ and /account/), then add bot-specific rules for AI crawlers you want to restrict. If you need enforcement, apply WAF rules that target specific user agents and paths rather than blanket blocks.