
The first sign of trouble was a support ticket. Not a spike in 5xx errors, not a latency alarm, not a failed health check, and not a support ticket. A user had asked the product's AI assistant to summarize a contract clause, and the assistant had done so confidently and completely incorrectly. The clause was about indemnification; the summary described something closer to warranty terms. The system had returned HTTP 200 in 340 milliseconds, well within every threshold we had defined. By every automated measure, nothing was wrong. That incident forced a conversation the team had been quietly avoiding: what does "reliability" actually mean when your application's core behavior is a language model generating text? The SLO we had was 99.5% availability; p95 latency under two seconds was measuring the envelope around the intelligence, not the intelligence itself. We were monitoring the truck, not what was in the box. The Problem with Importing SRE Directly SLOs work beautifully for deterministic systems. You define what "working" means, instrument it, measure it over a rolling window, and budget how much failure you can tolerate before slowing down releases. The whole model rests on one assumption so fundamental it's rarely stated: that an error is observable. A request either fails or it succeeds, and the system can tell the difference. LLM-powered applications break that assumption at the foundation. A response can be fluent, well-formatted, on-topic, and returned quickly and still be wrong in a way that matters. Worse, the wrongness isn't always consistent; the same prompt sent twice can produce two different responses with different accuracy levels, because temperature, context caching, and silent upstream model updates all introduce variance. You can't define a clean failure mode because failure isn't binary. The first approach most teams try, including ours, is to just add more metrics. Track token counts. Monitor refusal rates. Alert on response length anomalies. This produces dashboards that look comprehensive and provide almost no signal. Token count doesn't correlate with accuracy. Refusal rates tell you when the model won't answer, not when it answers incorrectly. Length anomalies are a very weak proxy for quality at best. Here's where things got genuinely difficult: the alternative, automated quality scoring, introduces its own reliability problem. You're using a second model to evaluate the first model's output, which means you now have two probabilistic systems in your observability stack. When the quality score degrades, you don't immediately know if the production model got worse or if the evaluator model changed behavior. It's a hall of mirrors if you're not careful. Decomposing What "Good" Means for a Specific Application. The conceptual shift that actually helped was stopping trying to define a single SLO for the LLM feature and instead asking a more specific question: what are the distinct ways this feature can fail, and which of them are measurable? For the contract summarization product, we ended up with four failure categories. The first two were traditional: the service was unavailable, or the service was slow. These we could measure exactly as before. The third was structural: the response was malformed and truncated, missing required fields in a JSON schema we controlled, or containing a refusal when the query was clearly in scope. This was partially measurable with deterministic checks. The fourth was semantic: the response was coherent and complete but factually wrong relative to the input. This was the hard one. Rather than pretending we could fully automate semantic quality measurement, we built a layered system. Deterministic checks ran on every response. Automated quality scoring via a separate evaluator ran on a sampled subset, maybe 5% of traffic for low-stakes queries and 20% for queries tagged as high-stakes by a lightweight classifier on the input side. Human review ran weekly on a small batch, stratified across query types. The human review was the calibration layer, not the primary measurement layer. The SLO itself was then split into three components with different measurement methods and different burn-rate implications: //Simplified SLO structure slo: operational: availability: 99.5% #HTTP 200 within timeout latency_p95: 2000ms measurement: real-time structural_quality: schema_compliance: 99.0% #Valid JSON, required fields present refusal_rate_max: 2.0% #For in-scoop queries measurement: real-time, all traffic sematic_quality: accuracy_floor: 92.0% #Human-calibarated evaluator score measurement: sampled, calibarated weekly The semantic quality SLO didn't have an error budget in the traditional sense. Instead, dropping below the accuracy floor triggered a review gate: no configuration changes, no prompt engineering deploys, and no retrieval pipeline modifications until the weekly human review had examined the sampled failures and signed off on a root cause. It's slower and less automated than a burn-rate alert, but it's also more honest about the inherent uncertainty in what you're measuring. The Latency Problem Nobody Talks About There's a second SLO design challenge specific to LLMs that gets less attention: latency measurement is genuinely different when responses are streamed token by token. A p95 of two seconds sounds acceptable until you realize that a two-second time-to-first-token followed by eight more seconds of streaming is a fundamentally different user experience than a ten-second wait followed by the full response appearing instantly. Both have the same total latency. Neither captures what the user actually experienced. We ended up tracking three latency metrics independently: time to first token (TTFT), inter-token latency (how smooth the stream felt), and total time to completion (TTTC). The SLO was defined primarily on TTFT because if nothing appears for more than two seconds, users start refreshing with a secondary soft limit on TTTC for batch-style queries where streaming wasn't user-facing. Inter-token latency was tracked but not part of the formal SLO; it fed into UX investigations rather than reliability decisions. #Prometheus metrics for streming LLM latency llm_time_to_first_token_seconds{service, model, query_type} llm_inter_token_latency_seconds{service, model} llm_total_completed_seconds{service, model, query_type} # SLO targets # TTFT p95 < 1.5s (hard SLO, burns error budget) # TTFT p95 < 12s (soft limit, triggers review not budget burn) In hindsight, defining the SLO on TTFT rather than total latency was one of the better decisions we made early on. It aligns the reliability metric with actual user abandonment behavior, which matters more for retention than the total time to generate a complete response. Mistakes We Made and What We'd Do Differently The biggest mistake was defining the semantic quality SLO too broadly as a single accuracy score across all query types. In practice, the model performed very differently on short factual queries versus long multi-document synthesis tasks. Averaging them together meant the score looked acceptable when in reality we had a significant accuracy problem in the synthesis category that was being masked by high performance on the easier queries. Segment by query complexity and type from the beginning. Don't let your SLO hide the variance. The second mistake was treating the evaluator model as ground truth rather than a noisy signal. When our evaluator's calibration drifted, which it did after a silent update to the underlying model we were using for evaluation, we spent two weeks investigating a false quality degradation that didn't exist in reality. Now the weekly human review explicitly includes calibration checks: a fixed set of golden examples with known correct answers is run through the evaluator to verify its own accuracy hasn't shifted. A subtler mistake: we didn't think hard enough about who owns semantic quality violations. Operational SLOs have a clear owner, the team running the service. But a semantic quality failure might be caused by the model itself, by the prompt engineering, by the retrieval system providing bad context, or by the input data being ambiguous. Without clear ownership, quality incidents became blame-diffusion exercises. Define ownership before you define the SLO. Key Takeaways Separate your SLO into operational, structural, and semantic layers. Each requires different measurement approaches and different response protocols when targets are missed. Don't try to collapse them into a single reliability number; the aggregation destroys the signal. For latency, track time to first token as the primary SLO metric for streaming applications. Total latency matters but is a secondary concern for user-facing experience. Automated quality scoring is a signal, not a source of truth. Calibrate it continuously against human review, and treat the evaluator itself as a system that can drift and fail. Define SLO ownership explicitly before you go to production. Quality failures in LLM systems are often multi-causal, and ambiguous ownership means nobody fixes them. Consider when NOT to use this approach: if your LLM feature is low-stakes or strictly informational with no business-critical accuracy requirement, a traditional operational SLO may be sufficient. The three-layer model adds real operational complexity. Don't introduce that overhead for a feature where semantic failures are recoverable user experiences rather than business risks. Conclusion The discomfort at the center of LLM reliability engineering is that we're trying to make probabilistic systems accountable to deterministic metrics. Error budgets were designed for systems where "correct" is unambiguous. When you're operating a system where correctness is a distribution and not a boolean, the frame starts to creak. The approach described here doesn't fully resolve that tension; it just makes it explicit and manageable. A layered SLO with honest measurement methods and human calibration in the loop is less elegant than a single percentage target on a Grafana dashboard. But it's closer to what reliability actually means for these systems. The deeper question is whether the SLO model is even the right abstraction for AI services long-term or whether we're in a transitional period where we're using old vocabulary for a genuinely new kind of system. I suspect the teams that figure out what comes after the error budget, some framework that can account for accuracy as a first-class reliability property, will have a meaningful advantage. What that looks like, I don't think anyone has built yet.
View original source — Hacker Noon ↗

