OpenAI 爬虫策略

OAI-SearchBot、GPTBot、ChatGPT-User 有什么区别?

这三个 token 对应不同用途。把它们混成“同一个爬虫”是最常见配置错误。

角色拆分

可见性优先的 robots 模板

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/links-map.txt

决策矩阵

策略上线后如何验证

# Nginx
grep -Ei "OAI-SearchBot|GPTBot|ChatGPT-User" /var/log/nginx/access.log | tail -n 80
grep -Ei "OAI-SearchBot|GPTBot|ChatGPT-User" /var/log/nginx/access.log | awk '{print $9, $7}' | head -n 80

# Caddy JSON
jq -r 'select(((.request.headers."User-Agent"[0]) // "") | test("OAI-SearchBot|GPTBot|ChatGPT-User"; "i")) | "\(.status)\t\(.request.uri)\t\((.request.headers."User-Agent"[0]) // "")"' /var/lib/caddy/logs/llmsfile-access.log | head -n 80

容易误判的点

继续阅读:robots 能拦 ChatGPT-User 吗如何检查 OAI-SearchBot 访问为什么 OAI-SearchBot 会被 403

查看 ChatGPT-User 问答 测试单条规则