
\ An eight-week log of the HTTP Accept-Language header sent by eleven verified AI bot user-agents. Most send nothing, some send wildcards, a few leak chat-session locale. The HTTP Accept-Language request header, defined in RFC 9110 §12.5.4 (the 2022 HTTP semantics spec, and before that RFC 7231 §5.3.5), is a content-negotiation hint a client sends to express which natural languages it would prefer the response to be in. For human browsers the header is filled in automatically from the user's operating system or browser language preferences and emitted on every request, almost always as a small comma-separated list with q -value quality weights, something like en-GB,en;q=0.9,fr;q=0.8 . The wire format is BCP 47 language tags , which is the spec governing the strings themselves ( en , en-US , zh-Hant ), with the q-list syntax inherited from the older HTTP grammar. The spec is clear, the browser behaviour is consistent, and nobody thinks about it for years at a stretch unless they are running a multi-locale site. For server-side AI fetchers, by contrast, none of those preconditions apply. There is no operating system, there is no user-language preference attached to the fetcher process, and there is no obvious answer to the question "what should the header even contain." Every operator in the AI-fetcher ecosystem has made an independent decision, and the decisions diverge in ways that turn out to be diagnostic of how their fetcher fleets are wired. I spent eight weeks logging the Accept-Language value on every verified AI bot request to a small but reasonably busy technical blog of mine, sorted the rows by user-agent, and reduced the answers to a handful of crosstabs. The headline finding is that there is no single AI-fetcher convention. Some bots send no Accept-Language at all. Some send a bare * wildcard. Some send a fixed en-US,en;q=0.9 that looks suspiciously like a stock browser default. A few send actual quality lists that vary across requests in ways that hint at chat-session locale leaking through from the user side, which is operationally interesting because it tells you something about where in the upstream stack the fetcher pulls its locale state from. The rest of this post is the field report: what the spec actually says, the methodology I used, the per-bot distributions, the bots whose values change across replays, and the consequences for any operator running a multi-locale site whose default-locale fallback is now silently choosing which translation gets indexed for every one of these fetchers. What RFC 9110 Actually Says Accept-Language is a content-negotiation request header. RFC 9110 §12.5.4 defines its grammar as a list of language ranges with optional quality weights: Accept-Language = #( language-range [ weight ] ) language-range = <language-range, see [RFC4647], Section 2.1> weight = OWS ";" OWS "q=" qvalue The language-range syntax is delegated to RFC 4647 §2.1 (the basic language range), which permits both language ranges in BCP 47 form ( en-GB , zh-Hant-HK , pt-BR ; the canonical list of valid subtags lives in the IANA Language Subtag Registry ) and the wildcard * which matches any language. Quality values are real numbers in the closed range [0, 1] with at most three decimals, and the implicit default when no q is given is 1.0 . The conventional browser pattern is to emit the most-preferred locale unweighted, followed by progressively-weighted alternates: en-US,en;q=0.9,fr;q=0.8 . The semantics are that the receiver, if it has multiple language variants of the resource, should pick one consistent with the weighted preferences, but the spec is explicit that this is a hint, not a directive. Origins are free to ignore the header entirely. Many do. Two further subtleties matter for the analysis below. First, the spec permits a completely absent Accept-Language header. The semantics of absence are "no preference," which is not the same as the wildcard Accept-Language: * . The wildcard says "any language is acceptable," the absent header says "I have not expressed a preference at all," and a strict origin can in principle treat them differently (in practice almost none do). Second, the wildcard is allowed to appear inside a list, with weights: Accept-Language: en, *;q=0.5 is a legal value meaning "prefer English, but anything else is acceptable at half-weight." That third pattern is rare in the wild but it does appear in this dataset, from one bot, and it is one of the more interesting finds. The header is a hint about the user's locale. For an AI fetcher there is no user in the same sense. The fetcher is constructing a request on behalf of a model that has no operating-system locale, no browser locale, no notion of language preference outside whatever its current conversational context happens to imply. So what a fetcher sends in Accept-Language is a choice the operator has to make explicitly, in code, with no obvious right answer. The space of choices their engineers picked from is small. The bots have settled into roughly five buckets: send nothing, send * , send a fixed en-US,en;q=0.9 lookalike, send a different fixed en value, or send something that varies across requests. The eight-week log is the data behind that taxonomy. Why It Matters for Multi-Locale Sites If you run a single-locale site this is mostly trivia. Your origin returns the same HTML regardless of Accept-Language , the bots send whatever they send, the indexed content is unambiguous. The interesting case is multi-locale sites, where the origin (or the routing layer in front of it) negotiates content based on Accept-Language . The standard pattern is "look at Accept-Language , pick the best matching variant, fall back to a configured default locale if no match." That fallback is the whole story for AI fetchers, because in the bulk of cases the header value is either absent or * or a single language tag that may or may not match any of your variants. The fallback rule decides what gets served, the fallback content is what gets indexed, and the indexed content is what feeds whatever downstream answer the model composes. It also matters operationally because the multi-locale sites I have audited tend to assume the fallback is hit only by curl users and broken clients. In practice, for these eleven bots, the fallback is hit on a substantial majority of inbound AI fetcher traffic (for some bots, on 100% of inbound traffic). Whatever is at your fallback locale is, for those bots, the entire surface. The other translations are invisible. If the fallback is an English variant and your French and German content is locale-routed away from English-speaking clients, the AI ecosystem mostly only ever sees the English version. That is a structural property of the routing layer that you almost certainly did not consciously decide. It is the consequence of a default rule meeting an unexpected client population. The corollary is that altering the routing logic for these bot UAs (for instance, serving the bot a hreflang -list-aware fallback that exposes all variants in the response, or detecting the bot UA and disabling locale negotiation entirely) has a much bigger effect on indexed coverage than a similar change for human traffic would. I am not going to make recommendations about routing strategy here; the point is that the bots' Accept-Language choices interact with whatever rule you have, and "whatever rule you have" is usually a rule that was designed for browsers. Methodology The setup is the same one I have been using for previous wire-level field logs. The blog runs on a single VPS behind Cloudflare, with nginx doing TLS termination at the metal because per-request observability is much easier there than going through a CDN logs subscription. The relevant bit of nginx config is the log_format directive, which I extended to capture the Accept-Language header verbatim using the $http_accept_language variable: log_format ai_log escape=json '{' '"ts":"$time_iso8601",' '"ip":"$remote_addr",' '"method":"$request_method",' '"path":"$request_uri",' '"status":$status,' '"bytes":$body_bytes_sent,' '"accept_language":"$http_accept_language",' '"referer":"$http_referer",' '"ua":"$http_user_agent"' '}'; access_log /var/log/nginx/ai_log.json ai_log; The capture window is eight weeks across late winter and early spring. Total inbound requests over the window were about 1.83 million. Filtering down to the AI-fetcher allowlist below, then PTR-verifying each candidate row, produced 51,884 rows. That is the dataset behind every number in this post. The allowlist is the same eleven user-agent prefixes I have used before, each verified by reverse-DNS PTR lookup against the published verification domains ( openai.com for GPTBot , ChatGPT-User , OAI-SearchBot ; anthropic.com for ClaudeBot , Claude-Web ; apple.com for Applebot ; googlebot.com for Googlebot/Google-Extended ; and so on). UA-matching rows that did not resolve to a PTR record under the expected domain were dropped. About 4.6% of UA-matching rows failed PTR verification, mostly impostors hammering the site on residential IP space. The eleven bots: GPTBot : OpenAI's training crawler. ChatGPT-User : OpenAI's user-triggered live fetcher. OAI-SearchBot : OpenAI's search-index crawler. PerplexityBot : Perplexity's index crawler (which, as I have written before, is overloaded with a live-fetcher role). ClaudeBot : Anthropic's training crawler . Claude-Web : older Anthropic UA, still showing up in logs. Googlebot/Google-Extended : Googlebot rows over the period. Applebot/Applebot-Extended : Applebot rows over the period. Bingbot : Microsoft's index crawler. Amazonbot : Amazon's crawler. Bytespider : ByteDance's crawler. Most of the analysis ran against a SQLite mirror of the JSON log. The core crosstab is one query. After bucketing the value, I ranked the top three observed values per bot: WITH classified AS ( SELECT CASE WHEN ua LIKE '%GPTBot%' THEN 'GPTBot' WHEN ua LIKE '%ChatGPT-User%' THEN 'ChatGPT-User' WHEN ua LIKE '%OAI-SearchBot%' THEN 'OAI-SearchBot' WHEN ua LIKE '%PerplexityBot%' THEN 'PerplexityBot' WHEN ua LIKE '%ClaudeBot%' THEN 'ClaudeBot' WHEN ua LIKE '%Claude-Web%' THEN 'Claude-Web' WHEN ua LIKE '%Googlebot%' THEN 'Googlebot/Google-Extended' WHEN ua LIKE '%Applebot%' THEN 'Applebot/Applebot-Extended' WHEN ua LIKE '%bingbot%' THEN 'Bingbot' WHEN ua LIKE '%Amazonbot%' THEN 'Amazonbot' WHEN ua LIKE '%Bytespider%' THEN 'Bytespider' END AS bot, CASE WHEN accept_language = '' OR accept_language = '-' THEN '(absent)' WHEN accept_language = '*' THEN '*' ELSE accept_language END AS al FROM ai_log WHERE ptr_verified = 1 ) SELECT bot, al, COUNT(*) AS n, ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY bot), 1) AS pct FROM classified GROUP BY bot, al ORDER BY bot, n DESC; The number of distinct values per bot is small enough that the entire output fits in a few screens. For the cases where a bot sends real quality lists (rather than just one fixed string), I wrote a small parser to split the list and extract the highest-weighted language tag. The parser is simple but worth showing because it has to handle the q-weight syntax and the BCP 47 tag form correctly: def parse_accept_language(value: str) -> list[tuple[str, float]]: """Return list of (language_range, q) sorted by q descending. Implements the parsing side of RFC 9110 section 12.5.4. Default q is 1.0 when no weight is given. Tags that are not BCP 47 language tags or '*' are tolerated and returned as-is; rejection is the receiver's job, not the parser's. """ if not value or value.strip() in ("", "-"): return [] out: list[tuple[str, float]] = [] for raw in value.split(","): part = raw.strip() if not part: continue if ";" in part: tag, *params = part.split(";") tag = tag.strip() q = 1.0 for p in params: p = p.strip() if p.startswith("q="): try: q = float(p[2:]) except ValueError: q = 0.0 out.append((tag, q)) else: out.append((part, 1.0)) out.sort(key=lambda x: -x[1]) return out def top_locale(value: str) -> str | None: parsed = parse_accept_language(value) return parsed[0][0] if parsed else None That is enough to reproduce every per-bot top-locale tally below from a JSON log. The parsing strictness is deliberately lax because some of the bots emit values that are not strictly conforming and I would rather see them in the buckets than throw them out. Per-Bot Distributions The first cut is just "do they send the header at all," and the answer is more uneven than I expected. Here are the four high-level buckets (absent, * wildcard, single-language fixed string, and multi-tag quality list) by bot: | Bot | Total verified | Absent | * wildcard | Fixed single | Quality list | |----|----|----|----|----|----| | GPTBot | 19,247 | 19,247 | 0 | 0 | 0 | | ChatGPT-User | 781 | 14 | 0 | 19 | 748 | | OAI-SearchBot | 5,562 | 5,498 | 0 | 64 | 0 | | PerplexityBot | 5,103 | 2,118 | 0 | 0 | 2,985 | | ClaudeBot | 8,624 | 8,624 | 0 | 0 | 0 | | Claude-Web | 244 | 0 | 0 | 244 | 0 | | Googlebot/Google-Extended | 7,041 | 0 | 7,041 | 0 | 0 | | Applebot/Applebot-Extended | 1,803 | 1,803 | 0 | 0 | 0 | | Bingbot | 2,415 | 2,415 | 0 | 0 | 0 | | Amazonbot | 711 | 711 | 0 | 0 | 0 | | Bytespider | 353 | 191 | 109 | 47 | 6 | The first thing to notice is the cleanest division in the table: most training crawlers send absolutely nothing. GPTBot , ClaudeBot , Applebot , Bingbot , Amazonbot : every one of those bots sent no Accept-Language header on any of the tens of thousands of requests they made. That is consistent with the logic that a training crawler walking a queue of URLs has no user, no session, no locale, and the simplest correct implementation is to omit the header. Five out of eleven bots picked that path. The "absent" column for those five is literally 100% of their traffic. The second thing to notice is the next-cleanest division: Googlebot sends * . Every single Googlebot/Google-Extended request in the sample carried Accept-Language: * . Not absent, not a list, not a specific language, just the bare wildcard. This is worth flagging against Google's own documentation on locale-adaptive pages , which states that Googlebot ordinarily sends HTTP requests without an Accept-Language header at all, and that for locale-adaptive content it relies on geo-distributed crawling from non-US IP addresses rather than on varying the header. The bare * I logged is not described in that document. Whether it represents a recent change, a Googlebot subsystem the docs do not cover, or a value Google emits specifically when crawling sites it has not classified as locale-adaptive, I cannot tell from header data alone. What I can say is that across 7,041 verified Googlebot/Google-Extended rows the value was * and only * , with no deviation. It is the only bot in the sample that uses the wildcard form as a fixed value. Two implementations, two clean rules: omit the header, or send * . The semantics in the spec are slightly different (no preference vs any language), but the operational effect on a multi-locale site is identical: the origin's fallback locale rule decides which variant is served. The third thing to notice is ChatGPT-User , where 95.8% of requests carry an actual quality list. This is the inverse pattern from the training crawlers, and it is the bot where the chat-session locale appears to leak through. Claude-Web is the simplest of the live-ish fetchers: every one of its 244 requests carried the same fixed string, en-US,en;q=0.9 , with no variation across the entire eight-week window. That value is exactly what a stock Chromium browser sends when the OS locale is en-US . The bot is plainly emitting a hardcoded default rather than reading any session state. It appears to be a vestigial implementation choice (the UA itself is on the way out, replaced by the newer Anthropic UAs) and the pattern is consistent with "we set the value once in code and never thought about it again." Bytespider is, predictably, the messiest. About 54% of its requests had no header, 31% had * , 13% had one of two fixed single-language strings ( zh-CN and en ), and a small handful (under 2%) carried a quality list with the wildcard inside it. The same URL, fetched twice within an hour, often carried different Accept-Language values across the two fetches. The pattern is consistent with what I have seen on every other wire-level field for Bytespider: a fleet rather than a single fetcher, with internal inconsistency that may reflect rolling deploys, A/B configurations, or different upstream callers wearing the same UA. I include the row to be honest about what I logged but I would not draw operational conclusions from it. OAI-SearchBot is mostly absent (98.8%) with a small slice (1.2%) carrying a fixed en-US value. The 64 non-absent rows clustered on a smaller set of URLs and arrived in tight bursts, which is the same shape as the Referer -bearing slice of OAI-SearchBot traffic I described in a previous field log: the bot has two roles, mostly an index crawler that omits the header, occasionally a verifier or freshness-checker that does set it. The two roles are distinguishable by the same wire-level signals. PerplexityBot , also predictably, is split: 41.5% absent, 58.5% with a real quality list. That split is uncannily close to the with- Referer / without- Referer split I documented for the same bot in the earlier field log, and on closer inspection it is essentially the same population: the rows that send a quality-list Accept-Language are largely the same rows that send a non-empty Referer , and they arrive in the bursty per-URL cadence that looks like live user fetches rather than periodic indexing. Whatever inside Perplexity is overloaded PerplexityBot to play both roles, it is consistently leaking session state into both Referer and Accept-Language on the live-fetcher slice. ChatGPT-User and Quality-List Variation ChatGPT-User is the bot where the most interesting variation lives. Of 781 verified requests, 748 carried a multi-tag quality list, 19 carried a fixed single-language string, and only 14 had the header absent. Across the 748 quality-list rows, the most-preferred locale (the unweighted leading tag) varied across at least 18 distinct BCP 47 tags. Here are the top three observed Accept-Language values for each bot, rounded to one decimal: | Bot | Top value 1 | Top value 2 | Top value 3 | |----|----|----|----| | GPTBot | (absent) 100.0% | | | | ChatGPT-User | en-US,en;q=0.9 47.2% | en-GB,en;q=0.9 11.8% | de-DE,de;q=0.9,en;q=0.8 7.3% | | OAI-SearchBot | (absent) 98.8% | en-US 1.0% | en-US,en;q=0.9 0.2% | | PerplexityBot | (absent) 41.5% | en-US,en;q=0.9 33.4% | en-GB,en;q=0.9 9.7% | | ClaudeBot | (absent) 100.0% | | | | Claude-Web | en-US,en;q=0.9 100.0% | | | | Googlebot/Google-Extended | * 100.0% | | | | Applebot/Applebot-Extended | (absent) 100.0% | | | | Bingbot | (absent) 100.0% | | | | Amazonbot | (absent) 100.0% | | | | Bytespider | (absent) 54.1% | * 30.9% | zh-CN 9.6% | The 47.2% / 11.8% / 7.3% spread on ChatGPT-User is the giveaway. en-US,en;q=0.9 is the stock American-English browser default. en-GB,en;q=0.9 is the UK variant. de-DE,de;q=0.9,en;q=0.8 is what you get from a Chromium browser whose OS locale is set to German with English as the secondary preference. The values appearing on a server-side fetcher are too consistent with browser defaults to be coincidence. The fetcher is not generating these from nothing; it is reading them from somewhere upstream that does have a locale, namely the chat session. The ChatGPT product knows the user's interface locale (which the user picks in settings, or which the browser auto-selects on first load), and that locale propagates down through the tool-call invocation into whatever HTTP client ChatGPT-User is built on. The header value in the wire log is, effectively, a copy of the Accept-Language that the user's own browser would send to the chat product, transported via the live-fetcher tool call to the third-party origin. This is a clean architectural diagnosis, and it has a corresponding operational consequence. The Accept-Language value on ChatGPT-User is not random and it is not arbitrary; it is a per-fetch correlate of the live human user who triggered the fetch. The distribution of values across an eight-week window is, approximately, the distribution of locales of the people who used ChatGPT to look at my content during that window. From my data: 47.2% en-US 11.8% en-GB 7.3% de-DE 4.5% fr-FR 3.1% es-ES and es-MX combined 2.4% nl-NL 2.1% pt-BR The remaining ~21% spread across about a dozen other tags ( it-IT , pl-PL , sv-SE , ja-JP , ko-KR , etc.) plus the fixed-string and absent rows That is the demographic of "people on ChatGPT who decided to read a small English-language technical blog over an eight-week window." For my site, with English-only content, the locale distribution is essentially "where are the readers"; for a multi-locale site, the same distribution would tell you where each translation's traffic was going to come from if the fetcher had been allowed to negotiate it. The same single URL on my site, fetched by ChatGPT-User more than once during the window, came with different Accept-Language values across fetches in 67 of the 91 cases where there was more than one fetch. The values shifted in a pattern fully consistent with "different humans triggered the fetch on different days." The bot has no persistent locale of its own. Whatever locale arrives in the header is borrowed from the user who is currently asking. That is essentially a per-request leakage of a single bit of user state, and it is observable in plain text in the access log. PerplexityBot's Split, Revisited The same logic applies, more weakly, to the live-fetcher slice of PerplexityBot . The 58.5% of its rows that carry a quality list show a similar distribution dominated by en-US,en;q=0.9 and a smaller tail of European locales, but the long tail is shorter than ChatGPT-User 's, and the mode is more dominant. Two readings are plausible. One: Perplexity's user base is more concentrated in en-US than ChatGPT's, which is consistent with Perplexity's market position as a US-launched product with strong North American adoption. Two: Perplexity's live fetcher does not pull Accept-Language directly from the user-side browser header but from a server-side locale that is set somewhere in the user's account preferences, and the account-preference locale defaults to en-US more aggressively than the browser-derived header would. I cannot tell from header data alone which of those is the case. But the wire-level fact is clear: for PerplexityBot 's live-fetcher subset, the Accept-Language value is non-trivially influenced by the upstream user, and across the eight-week window the bot's Accept-Language distribution carries information about Perplexity's user base in the same way ChatGPT-User 's does about ChatGPT's. The rows that don't send the header are the index-crawler subset, and they are uniform in their absence. The split is the same one the Referer analysis surfaced, and combining the two signals gives you a reliable flag for "this was a live user fetch" versus "this was an indexing fetch" within the PerplexityBot rows. The * Wildcard, Inside a List The Bytespider rows include 109 instances of the bare * wildcard, which is the second-largest bucket for that bot. That is the first place I have seen * outside Googlebot. There were also 6 Bytespider rows that emitted Accept-Language: zh-CN, *;q=0.5 , a small handful, but worth flagging because that is the only place in the entire 51,884-row dataset where a bot emitted the wildcard inside a quality list rather than as a standalone value. RFC 9110 §12.5.4 permits this construction explicitly, and it is the most precise possible expression of "I prefer Chinese, but I will take anything." It is also the kind of construction nobody writes by hand. Some Bytespider component has a real implementation of Accept-Language building, while others just send * , and others send nothing. The bot is plainly a fleet, and the components have not converged on a single Accept-Language strategy. What This Means for Locale Routing If you operate a multi-locale site, the routing layer in front of your origin almost certainly does something like the following on every inbound request: look at Accept-Language , pick the highest-weighted tag, find the closest matching variant of the requested URL, fall back to a configured default if no match. The fallback rule is where these bots all collapse into. From the table above: Five bots ( GPTBot , ClaudeBot , Applebot , Bingbot , Amazonbot ) hit your fallback on 100% of requests because they send no header. Googlebot/Google-Extended hits your fallback on 100% of requests because * matches everything and the routing logic typically resolves a wildcard to the default. OAI-SearchBot hits the fallback on 98.8% of requests for the same absent-header reason. Claude-Web will match an en-US variant if you have one, otherwise fall back to default. PerplexityBot 's index slice hits the fallback; its live-fetcher slice negotiates by the user's locale. ChatGPT-User negotiates by the user's locale on 95.8% of requests. Bytespider is mostly the fallback. The aggregate effect is that the great majority of AI fetcher traffic to a multi-locale site is going to be served the fallback locale. Whatever is at your fallback is what gets indexed, summarised, embedded, and cited by the AI ecosystem at large. If your fallback is your English variant, the German and French and Japanese translations are mostly invisible to these bots. If your fallback is geographically determined (some routers fall back based on GeoIP rather than Accept-Language ), the answer depends on what continent your AI fetchers are coming out of, which for most of these vendors is "North America", and the consequence is that the site behaves to AI bots as if it were a US-localised single-language site, regardless of how many translations you have published. The only two bots that materially negotiate are ChatGPT-User and the live-fetcher slice of PerplexityBot . Those are the bots that fetch on behalf of a real user with a real locale, and for a multi-locale site they will, when they fetch your URL, pick the variant matching the user's locale. That has a much smaller volume than the training and index crawlers, but it is also the slice with direct human attention attached, and it is the slice where serving the right locale matters most. The implication for content strategy is that "which translation gets indexed" is not the same question as "which translation gets cited to a user." The first is decided by your fallback rule and dominates the index-crawl traffic. The second is decided by the user's chat-session locale and dominates the live-fetcher traffic. They can produce inconsistent outcomes: your French translation is invisible to GPTBot (so the corpus does not contain it), but ChatGPT-User will fetch it for French users (who will then get a French page of which the model has no training signal). I have not measured the downstream consequence of that inconsistency on actual answer quality, and I do not want to guess. A Brief Note on hreflang The interaction with hreflang is worth a sentence and not more, because the topic deserves its own field test that I have done elsewhere. hreflang annotations do not change Accept-Language behaviour; they are independent signals declared in the HTML or HTTP headers that tell crawlers about alternate-language variants of the same content. A bot that sees hreflang may follow the alternates or not depending on its implementation, regardless of what it sends in Accept-Language . The two mechanisms compose: Accept-Language decides which variant to serve right now , hreflang advertises that alternates exist for the bot to consider separately . For the AI fetcher fleet here, hreflang discovery is uneven (some bots follow the alternates, some don't) and the question of which translation actually ends up indexed is the joint outcome of the two mechanisms, not either alone. Caveats The dataset is one technical blog over eight weeks. The numbers will drift across vendors and across releases. Specific points worth flagging: The exact Accept-Language values are observational. A bot that sends a quality list today may send * tomorrow, or vice versa. The shape of the taxonomy (most absent, one wildcard-only, one fixed-string, two with real lists) is more durable than the specific percentages. ChatGPT-User 's locale distribution is a sample of my readers on ChatGPT. The values would be different on another site, and would shift over time as ChatGPT's user base shifts. The PTR-verification step excludes some legitimate bot traffic whose source IPs do not resolve to the expected vendor PTR records. The percentages are computed against the verified subset only and are slightly conservative as a result. Small-sample bots ( Claude-Web 244 rows, Bytespider 353 rows, Amazonbot 711 rows) are more sensitive to noise. The all-absent and all-fixed patterns are robust because every single row agreed; the more interesting splits on smaller bots should be read as suggestive rather than precise. The mapping of locale tag to "user locale" assumes the chat product propagates the browser-side Accept-Language to the fetcher faithfully. If the chat product has its own normalisation step (collapsing minor regional variants, picking a single locale per user, etc.), the wire-level value is downstream of that normalisation, not a direct mirror of the user's browser. Closing Thoughts The Accept-Language header on AI fetcher traffic ends up doing something the spec authors cannot have anticipated, which is to leak operator implementation choices across every layer of the fetcher stack. A bot that sends absent is signalling that its operator wrote the simplest possible client. A bot that sends * is signalling that the operator thought about the spec long enough to find the wildcard form. A bot that sends a fixed en-US is signalling that someone wrote the value once in code and never revisited it. A bot that sends a real quality list is signalling that the fetcher is plumbed into a session-aware system upstream and the user-side locale is being copied through. Each of those signals is observable in the log without any inside information, vendor relationship, or special tooling. The methodology is two SQL queries, an nginx log line, and a small parser. The thing that surprised me most was not the variance, which I expected, but the consistency within each operator's choice. Five training crawlers all picked "send nothing" independently and stuck with it for eight weeks across hundreds of thousands of requests with not a single deviation. Googlebot picked * and sent it on every single one of seven thousand requests. Claude-Web picked en-US,en;q=0.9 and sent it on every single request. The deviations, where they exist, are the live fetchers ( ChatGPT-User , the live-fetcher slice of PerplexityBot ) where there is genuinely a per-request user state to leak into the header, and even there the value is structured, not chaotic, with the long tail concentrated on a small set of common locale defaults that are visibly browser-derived. For multi-locale sites, the practical takeaway is uncomfortable. The default-locale fallback on a routing layer is doing more work than its designers intended; it is the rule that picks the variant for the bulk of inbound AI traffic, including all the training crawlers, all the index crawlers, and the lion's share of the search-index fetchers. Whichever translation lives at the fallback is, for most of the AI ecosystem, the only translation that exists. The other translations are mostly visible only to the live-fetcher slice, and only when a user's chat-session locale matches them. That is not an inherently bad outcome (most sites pick their primary locale as the fallback for good reasons) but it is a structural property of the system that operators should at least know about, because right now most of them do not. Eight weeks of staring at one HTTP header surfaced a clearer picture of how the bots were thinking about locale than any vendor blog post or trade-press article has done in the last two years. I am going to keep watching it. The bots will change, the percentages will drift, the taxonomy may add or lose buckets, but the field is small, the data is in your own logs, and the diagnostic value of the header is going to remain disproportionate to its size for as long as the AI fetcher ecosystem stays as fragmented as it currently is. \n \
View original source — Hacker Noon ↗


