Scaling AI Inference on Kubernetes: The Case for Token-Based Autoscaling

You Scaled the Wrong Thing We hit the wall six weeks after shipping LLM inference to production. Not a crash, not an outage - just latency quietly climbing past SLO while every metric we were watching looked fine. CPU normal. Pod count healthy. Request rate within bounds. It took an hour of digging to find the actual problem: our autoscaler was treating a 200-token summary request and an 8,000-token document analysis as identical units of work. They are not. One of them is roughly 40x more expensive on the GPU. We had instrumented everything except the thing that actually mattered, and our HPA config had been making confidently wrong scaling decisions since day one. This is not a configuration problem. It is a conceptual one. HPA was built for stateless HTTP workloads where requests are roughly equivalent units of work. LLM inference requests are not. The only meaningful unit of work in inference is the token. Why Request Count Fails for Inference For a typical HTTP service, requests are reasonably homogeneous. There is variance in response time, but it clusters tightly around a mean. Scaling on request count or CPU utilization works because both are reliable proxies for actual load. LLM inference has two properties that break this assumption completely. Token count variance is enormous and unpredictable . A request with a 200-token prompt and a 50-token completion takes roughly 5-10x less GPU time than a request with a 4,000-token prompt and a 500-token completion. Both register as exactly one request. In a mixed-traffic production environment, this variance is not an edge case - it is the norm. GPU compute is the bottleneck, and tokens are what consume it . Every token generated requires a forward pass through the model. Every token in the prompt occupies KV cache memory. Neither of these costs is captured by CPU metrics or request count. You can have ten concurrent requests all sitting at low CPU while the GPU is saturated, and your autoscaler will not react. The net effect: request-count-based HPA has a systematic blind spot for the exact workload characteristic that determines whether your inference pods are overloaded or idle. \n The Metric That Actually Matters The core shift is this: stop thinking in requests per second and start thinking in tokens per second - specifically, the ratio of tokens being processed to tokens your current fleet can handle. The two metrics you need are: Token throughput - total tokens processed per second across prompt and completion tokens. This is your demand signal. It tells you how much actual inference work is happening right now. Token capacity headroom - how close your current pod count is to its maximum sustainable token throughput before latency degrades. This is your supply signal. Every model on every hardware configuration has a throughput ceiling beyond which time-to-first-token and inter-token latency start climbing. \ The autoscaling signal becomes: scale_up when:   current_token_throughput / max_sustainable_throughput > 0.70 scale_down when: current_token_throughput / max_sustainable_throughput < 0.30 \ Translated into a Kubernetes HPA manifest using a custom metric from Prometheus Adapter: apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: inference spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 20 metrics: - type: Pods pods: metric: name: throughput_utilization_ratio # custom metric via Prometheus Adapter target: type: AverageValue averageValue: "700m" # 0.70 — scale up above 70% utilization behavior: scaleUp: stabilizationWindowSeconds: 90 policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 120 \ Thresholds depend on your latency SLOs and traffic variability. The point is the ratio - you are now scaling against actual inference load, not a proxy that correlates poorly with it. \ What to Instrument Most inference servers (vLLM, TGI, Triton) expose the raw data you need via /metrics . The challenge is surfacing it to your autoscaler via Prometheus and the custom metrics API. From your inference server prompt_tokens_per_second - tokens processed in prefill phase across all active requests completion_tokens_per_second - tokens generated in decode phase across all active requests total_tokens_per_second - sum of the above; your primary load signal queue_depth - requests waiting for a GPU slot; the leading indicator before throughput degrades gpu_kv_cache_utilization - KV cache memory consumption as a percentage; a hard ceiling that causes OOM if ignored time_to_first_token_p95 - your latency canary; rising TTFT before throughput peaks is your earliest warning Prometheus recording rules Compute throughput_utilization_ratio as a recording rule so Prometheus Adapter can serve it as a custom metric: # prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: llm-inference-rules namespace: inference spec: groups: - name: llm.inference.throughput interval: 15s rules: # Total tokens/sec per pod — sum prompt + completion - record: llm:total_tokens_per_second:pod expr: | sum by (pod) ( rate(vllm:prompt_tokens_total[1m]) + rate(vllm:generation_tokens_total[1m]) ) # Utilization ratio: actual throughput vs benchmarked ceiling # Set llm_max_tokens_per_second_per_pod to your benchmarked value - record: llm:throughput_utilization_ratio:pod expr: | llm:total_tokens_per_second:pod / on() group_left() scalar(llm_max_tokens_per_second_per_pod) # Effective batch size — tracks traffic mix shift over time - record: llm:effective_batch_size:pod expr: | sum by (pod) ( rate(vllm:prompt_tokens_total[1m]) + rate(vllm:generation_tokens_total[1m]) ) / sum by (pod) (rate(vllm:request_success_total[1m]) > 0) Prometheus Adapter config Wire the recording rule into the custom metrics API: # prometheus-adapter-config.yaml rules: - seriesQuery: 'llm:throughput_utilization_ratio:pod{namespace!="",pod!=""}' resources: overrides: namespace: { resource: namespace } pod: { resource: pod } name: matches: "llm:throughput_utilization_ratio:pod" as: "throughput_utilization_ratio" metricsQuery: 'avg(llm:throughput_utilization_ratio:pod{<<.LabelMatchers>>}) by (<<.GroupBy>>)' The llm_max_tokens_per_second_per_pod value requires benchmarking your specific model on your specific hardware. Run load tests at different prompt/completion ratios, find where TTFT p95 exceeds your SLO, and set that throughput as your ceiling. This is a one-time calibration that pays ongoing dividends. \ Scaling Behavior: What Changes in Practice The table below summarizes how behavior differs across the scenarios that matter most in production: | Scenario | Request-Count HPA | Token-Throughput Scaling | |----|----|----| | Surge of short prompts | Over-scales; low GPU utilization | Scales proportionally; GPU stays efficient | | Few long-context requests | Under-scales; GPU saturated | Detects load correctly; scales up | | Mixed traffic shift | Scales on noise | Tracks actual load distribution | | Idle after peak | Slow scale-down (request lag) | Fast scale-down (throughput drops immediately) | | KV cache pressure | Not detected | Captured via cache utilization metric | \ The last row matters more than it looks. KV cache exhaustion causes inference servers to either queue requests or OOM-kill pods. Neither is visible to request-count HPA. Token-throughput scaling with a KV cache utilization signal can trigger scale-up before OOM events occur. \ Scaling Floors, Ceilings, and Cold Starts A few practical realities to handle before shipping this to production: Set a meaningful minimum replica count. Scaling to zero for inference workloads looks attractive on a cost spreadsheet and is brutal in practice. LLM inference pods take 60–120 seconds to start depending on model size and whether you are pulling weights from remote storage. A minimum of two replicas buys you headroom to absorb traffic spikes while scale-up completes. This is already reflected in the HPA manifest above ( minReplicas: 2 ). Tune cooldown periods for GPU provisioning latency. HPA's default cooldown is 300 seconds for scale-down. For inference, scale-up cooldown matters more: if you scale up and immediately scale back down before new pods are ready, you create oscillation. The scaleUp.stabilizationWindowSeconds: 90 in the manifest above is a safe starting point - verify it against your actual pod startup time. Benchmark your throughput ceiling before you need it . The max sustainable throughput value is the foundation of this entire approach. Benchmark it at your actual traffic mix - not just uniform prompts - because prompt/completion ratio shifts it significantly. A model that handles 2,000 tokens/second under short-prompt traffic may only sustain 800 tokens/second under long-context workloads. Pre-warm on predictable traffic patterns using a KEDA ScaledObject with a cron trigger alongside your HPA: apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: llm-inference-prewarm namespace: inference spec: scaleTargetRef: name: llm-inference minReplicaCount: 2 maxReplicaCount: 20 triggers: # Pre-scale to 6 replicas before morning traffic ramp - type: cron metadata: timezone: America/Los_Angeles start: "50 7 * * 1-5" # 7:50 AM weekdays end: "0 9 * * 1-5" # back to HPA control at 9 AM desiredReplicas: "6" # Pre-scale for scheduled batch jobs - type: cron metadata: timezone: America/Los_Angeles start: "55 23 * * *" # 11:55 PM nightly end: "30 0 * * *" # 12:30 AM desiredReplicas: "4" # Token throughput metric for reactive scaling outside cron windows - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc:9090 metricName: throughput_utilization_ratio query: | avg(llm:throughput_utilization_ratio:pod{namespace="inference"}) threshold: "0.7" Token-throughput scaling is reactive by design; pre-warming makes it proactive where traffic is predictable. \ The SLO Rewrite Once you are scaling on token throughput, your SLOs need to catch up. Request-based SLOs are still valid but incomplete. The SLOs that actually describe inference health are: Time-to-first-token p95 - the latency the user actually feels; should be your primary SLO Inter-token latency p95 - smoothness of streaming responses; degrades under GPU saturation before TTFT does Token throughput utilization - should stay below your scale-up threshold during normal operation; if it is consistently at 85%+ you have a capacity planning problem, not a scaling problem Queue depth sustained > N for M minutes - the leading indicator that your scaling is not keeping up with demand \ These SLOs give your on-call engineers something actionable. "p95 TTFT is 4.2 seconds against a 2-second SLO" is diagnosable. "Request error rate is 0.1%" in an LLM context tells you almost nothing. Key Takeaways | 1 | Request count is a fundamentally broken scaling signal for LLM inference. Requests vary by orders of magnitude in GPU cost depending on prompt and completion length - one metric cannot represent both. | |----|----| | 2 | Token throughput utilization ratio is the correct primary scaling signal - the proportion of your current fleet's maximum sustainable throughput that is actively being consumed. Build your autoscaling logic around this number. | |----|----| | 3 | Benchmark your throughput ceiling on real traffic mixes before going to production. The ceiling shifts significantly based on prompt-to-completion ratio and is the load-bearing number your entire scaling strategy depends on. | |----|----| | 4 | Rewrite your SLOs around time-to-first-token and inter-token latency, not request rate. These are the metrics that reflect actual user experience and that your new scaling strategy is designed to protect. | |----|----| \ \n \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook