Why Claude.ai Streams Its Answers Over POST (and How I Reused the Trick Without EventSource)

\ I spend most of my day inside Claude and Codex, and somewhere along the way the streaming UX rewired my expectations. I no longer want a spinner and then a wall of text. I want to watch the thing think — the tokens arriving one by one, the little collapsible blocks where it second-guesses itself, the "calling a tool" line, the "running code" line, the moments it pauses to re-read what it just wrote. You can pop any of those open and inspect them. That progressive disclosure has quietly become the default contract for talking to a machine, especially for programmers and operators who work with documents and long-running jobs. What struck me is how narrow the pattern still is. It lives almost entirely inside AI chat. But there's nothing about "stream the work, not just the result" that is specific to language models. Any backend that does several slow things in sequence — fan out to APIs, score the results, filter, rank — could show its work the same way. So I went looking for how the AI apps actually do it, and then rebuilt the mechanism for something that has nothing to do with chatbots: a travel assistant that answers "where can I escape to?" The detail everyone skips: it's SSE, but over POST Open your browser devtools on claude.ai, go to the Network tab, and send a message. You'll see the request that carries your prompt come back with Content-Type: text/event-stream , dribbling the answer out in chunks. So far, textbook Server-Sent Events . But look again at the method . It's a POST . The same request that uploads your message and your files is the one that streams the response back. If you've only ever used SSE through the browser's built-in EventSource , that should look wrong. And it is wrong — for EventSource . The HTML spec defines EventSource as issuing a plain GET ; the constructor takes a URL and nothing else. No request body, no custom verb. The canonical SSE setup is therefore two-legged: a POST to hand over your payload, which returns some job ID, and then a separate GET to an event endpoint that you subscribe to with new EventSource('/events?id=...') . Behind that there's usually a Redis instance and a pub/sub fabric shuttling tokens from whatever worker is doing inference to whatever edge node is holding your GET open. If you've never wired any of this up by hand, HackerNoon's own Server-Sent Events 101 is a good warm-up on the wire format. The single- POST approach Anthropic uses collapses that. One request goes out with the body; that exact request answers with event-stream headers and streams the partial results straight back. I can only guess at how their load balancing copes — a long-lived streaming response is harder to pool than a quick request/response — but the client story gets dramatically simpler. There's no ID to mint, no second connection to correlate, no window where the job exists but you're not listening yet. I'm not Baba Vanga, but I'd be surprised if this shape didn't show up all over the place soon. It's too convenient to ignore. The price is that you give up EventSource . To consume an event-stream that arrives as the body of a fetch you wrote yourself, you read the response as a stream and parse the frames by hand. Reading a POST stream by hand \ The whole demo is two files — server.js and index.html , on GitHub . Here's the entire client-side trick, and it's smaller than you'd expect: \ const res = await fetch("/assist", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ q: query }), }); const reader = res.body .pipeThrough(new TextDecoderStream()) // bytes -> UTF-8 text .getReader(); let buffer = ""; while (true) { const { value, done } = await reader.read(); if (done) break; buffer += value; const frames = buffer.split("\n\n"); // SSE frames end in a blank line buffer = frames.pop(); // last chunk may be a partial frame for (const frame of frames) { const line = frame.split("\n").find((l) => l.startsWith("data:")); if (line) handle(JSON.parse(line.slice(5).trim())); } } res.body is a ReadableStream of bytes. Piping it through TextDecoderStream turns those bytes into text without me ever touching a TextDecoder manually, and — importantly — it handles multi-byte characters that get split across two network chunks. That's a bug you will absolutely hit if you decode chunks yourself with a naive String.fromCharCode . The one part people get wrong is frame boundaries. The event-stream format separates messages with a blank line — a double newline. But TCP doesn't care about your frames, so a single read() can hand you one frame, three frames, or one-and-a-half frames. The buffer.split("\n\n") plus frames.pop() dance is what keeps a half-frame around until the rest of it shows up. Skip that and you'll spend an evening debugging JSON.parse errors that only happen under load. That's the whole consumer. No library. Now the server. The backend: one POST that returns a stream I built the backend on Bun because Bun.serve makes returning a streaming Response painless and the cold start is basically zero — nice for a demo you'll screen-record. The shape is: route POST /assist , build a ReadableStream , run the pipeline inside its start() , and enqueue an SSE frame at every milestone. Bun.serve({ port: 3000, idleTimeout: 120, // a streaming response must outlive the default timeout async fetch(req, server) { const url = new URL(req.url); if (req.method === "POST" && url.pathname === "/assist") { const { q } = await req.json(); const enc = new TextEncoder(); const stream = new ReadableStream({ async start(controller) { const send = (o) => controller.enqueue(enc.encode(`data: ${JSON.stringify(o)}\n\n`)); await runPipeline(q, clientIP(req, server), send); controller.close(); }, }); return new Response(stream, { headers: { "Content-Type": "text/event-stream; charset=utf-8", "Cache-Control": "no-cache, no-transform", "X-Accel-Buffering": "no", // stop nginx & friends from buffering }, }); } // ...serve index.html for GET / }, }); send() is the entire protocol: stringify an object, wrap it in data: … \n\n , push it. Every call lands in the browser the instant it's enqueued. Putting the LLM in its place This is the part I find most interesting, and it's the opposite of how most "AI app" demos are built. The language model here is not a tool-calling agent that decides what to do next. It's one function in a fixed chain — it does a single job (turn a sentence into structured intent) and then gets out of the way. Everything after it is boring, deterministic API plumbing. The pipeline, each step of which is one send() to the browser: Understand — emit Understanding your request… immediately, so the UI has something the moment the connection opens. Geo + intent, in parallel. I fire an IP-geolocation lookup against ip-api and, concurrently, ask an LLM on Groq 's free tier (or Cerebras — both speak the OpenAI chat-completions dialect, so switching is a base-URL change) to extract three fields from the free text: days_from_now (0–5), weather ( cold / normal / hot ), and hours (max non-stop flight time). I force response_format: { type: "json_object" } so I get parseable JSON, not prose. The model's one-line summary becomes the next streamed step. Departure airport. With the user's coordinates I query AirLabs' nearby endpoint and pick the busiest airport in range. AirLabs is a flight-data API I reached for because it exposes exactly the three things this problem needs — nearby airports, scheduled routes, and airport coordinates; you can see the full catalog at airlabs.co . Stream: Departing from Valencia Airport (VLC). Where can I go non-stop? This is the heart of it. The Routes DB returns every scheduled route out of an airport, and crucially each route carries a days array (which weekdays it operates) and a duration . I compute the target weekday from today + days_from_now , drop routes that don't fly that day, group what's left by arrival airport, and stream 37 non-stop destinations on WED. Measuring distances… Distances. A batch call to AirLabs' airports endpoint resolves coordinates and names for every arrival code at once. I run a haversine from the departure airport to each candidate, then filter by the hours budget — using the real schedule duration when present and a distance-based estimate when it isn't. Weather. For the survivors I hit OpenWeather's 5-day / 3-hour forecast , reduce each one to the target day's high and sky condition, and keep only the ones whose temperature lands in the band the user asked for. To stay friendly to free-tier limits I cap the fan-out and run the requests in small concurrent batches. Result. Stream the final list — city, IATA, distance, estimated flight time, forecast, and a couple of real flight numbers with departure times. Seven visible steps, one of which happens to be a language model. From the user's side it reads like the assistant is reasoning out loud; under the hood it's a chain of HTTP calls narrating itself. Gotchas worth the paragraph A few things bit me, and they're the reason I'm writing this down: Proxy buffering will eat your stream. Reverse proxies love to buffer responses for efficiency, which is death for SSE — the user sees nothing, then everything at once. X-Accel-Buffering: no plus Cache-Control: no-transform covers nginx and most CDNs. Idle timeouts kill long chains. A pipeline that waits on five external APIs can easily out-sit a default socket timeout. Bun's idleTimeout defaults low; bump it. Geolocating localhost returns nothing. During development your client IP is 127.0.0.1 , which no geo-IP service can place. I detect private ranges and fall back to the server's public IP (and added a DEMO_LAT / DEMO_LON override so a screen-recording is reproducible). Order vs. parallelism. Running geo and the LLM concurrently is a real latency win, but I still emit the steps in a fixed, readable order. Fast and legible aren't the same axis; decouple them. Why I think this leaves the chatbox The reason chat UIs stream isn't that tokens must appear gradually — it's that a long operation feels broken when it's silent, and feels trustworthy when it narrates itself. That's not an LLM property. It's a property of any multi-second task. A deploy pipeline, a data import, a multi-leg search, a fraud check, an agentic checkout — all of them are sequences of slow steps that currently hide behind a spinner and could instead show their work, with the same one- POST -streams-back plumbing and the same fifteen-line client reader. The travel assistant is just my proof that the mechanism is domain-agnostic. The next time you're staring at a loading spinner that's been spinning for four seconds, ask why it isn't telling you what it's doing. The tools to fix that have been in the browser the whole time — I just had to stop reaching for EventSource . \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook