Question

How to verify AI crawler authenticity

User-agent text is easy to spoof. Before you open policy access, confirm the traffic behaves like a real crawler and not a scanner.

Practical workflow (from real logs)

  1. Check a 24-72h window. One single hit is not enough to trust.
  2. Look at path order. Legit bots often touch /robots.txt, then sitemap files, then content pages.
  3. Check status distribution. Real access usually has many 200 responses on allowed pages, not mostly 403/404.
  4. If possible, verify source ownership in cloud firewall or WAF logs before policy changes.

What suspicious traffic looks like

If you see crawler names hitting /wp-admin, /.env, or random exploit probes, treat that as spoofing noise. Do not whitelist based on those lines.

Fast command pattern

grep -Ei 'OAI-SearchBot|Claude-SearchBot|PerplexityBot' access.log | tail -n 200
grep -Ei 'OAI-SearchBot|Claude-SearchBot|PerplexityBot' access.log | grep -E 'robots.txt|sitemap|llms.txt'

Common false signal

Many teams mistake “UA contains bot name” as proof. It is only a clue. Behavior pattern is the stronger signal.

Generate verification commands