
\ Why I built Respawn: a stateful OpenAI Responses API gateway for Ollama and self-hosted backends Local LLM infrastructure has become very good at inference. You can run models locally with Ollama, llama.cpp, vLLM, LM Studio, or other backends. You can expose an OpenAI-compatible endpoint. You can point clients at localhost and get tokens back. That is already useful. But after working with local LLM stacks, I kept running into the same gap: local inference is not the same thing as a local OpenAI-like platform. A model server can generate text. A platform API does more than that. Modern clients increasingly expect things like: stored response objects previous_response_id response lifecycle endpoints streaming events with a predictable shape tool-call protocol behavior file and image inputs background jobs cancellation consistent error payloads request IDs metrics readiness checks SDK ergonomics A thin “OpenAI-compatible” endpoint is often enough for simple prompts. It is usually not enough when you start wiring local models into agents, coding tools, tests, internal services, or anything that expects the OpenAI Responses API to behave like an actual API contract. That is why I built Respawn . Respawn is an open-source local OpenAI-shaped API gateway for self-hosted LLM backends. It sits in front of Ollama or other local backends and adds the API/platform layer that many modern clients expect: /v1/responses , previous_response_id , stored responses, normalized streaming, tool-call protocol data, files, background jobs, and observability. Respawn is a local OpenAI-shaped API gateway for self-hosted LLM backends. It is not an inference runtime. It does not load models. It does not schedule GPU work. It does not batch tokens. It does not quantize weights. It does not manage KV cache. Those jobs belong to the model backend underneath it. Respawn sits in front of the backend and provides the missing API/platform layer. OpenAI SDK / Codex / local agent / internal service -> Respawn -> Ollama or another local model backend The first backend is Ollama. The idea is simple: let inference servers stay focused on inference, and put the OpenAI-shaped API behavior in a gateway layer that can be tested, observed, and replaced independently. “OpenAI-compatible” is not one thing When people say a local backend is “OpenAI-compatible,” they usually mean one of several different things. Sometimes it means the backend accepts a request that looks like /v1/chat/completions . Sometimes it means the official OpenAI SDK can be pointed at a custom base_url . Sometimes it means streaming mostly works. Sometimes it means tool calls mostly work. But the newer Responses API raises the bar. It is not just a request body and a text response. It has state, lifecycle, output items, input items, stored responses, streaming event semantics, and tool-call protocol behavior. That is a very different compatibility problem. For example, consider conversation continuity. Without server-side state, the client usually has to keep the whole conversation history and resend it on every request. That works, but it turns state management into a client concern. Different clients do it differently. Long-running workflows become harder to inspect, replay, test, and debug. With previous_response_id , the client can point to a prior response, and the server can reconstruct the chain. That is the behavior Respawn implements locally. A follow-up request can reference a previous stored response. Respawn loads the response chain, validates access, reconstructs the context, appends the new input, and forwards the resulting prompt to the backend. The model backend still just sees a generation request. The gateway owns the Responses state. What Respawn actually does Respawn exposes a familiar /v1 API surface for OpenAI SDKs. The core endpoint is: POST /v1/responses It supports blocking, streaming, and background Responses flows. It also supports lifecycle endpoints such as: GET /v1/responses/{response_id} DELETE /v1/responses/{response_id} GET /v1/responses/{response_id}/input_items POST /v1/responses/{response_id}/cancel That means local responses can be created, stored, retrieved, inspected, cancelled, and soft-deleted through an OpenAI-shaped interface. Respawn can store state in Postgres or SQLite. The default Docker stack uses: Respawn Ollama Postgres VictoriaMetrics Grafana The goal is not only to get text back from a model. The goal is to make local inference easier to integrate into software systems that expect stable API behavior. Here is what using it with the OpenAI Python SDK looks like: from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="local-dev-key", ) response = client.responses.create( model="gpt-oss:120b", input="Explain Kubernetes in one sentence.", ) print(response.output_text) And a follow-up can use the previous response: follow_up = client.responses.create( model="gpt-oss:120b", previous_response_id=response.id, input="Now explain it to someone who only knows Linux processes.", ) print(follow_up.output_text) The backend does not need to understand previous_response_id . Respawn handles that. Tool calls should be protocol data, not hidden side effects One area where I wanted a clean boundary is tool calling. Respawn supports the Responses function-tool protocol, but it does not execute arbitrary user functions locally. The flow is: The client sends function tool definitions. The model may emit a function_call output item. The client executes the function. The client submits a function_call_output input item in a follow-up request. Respawn validates, stores, replays, and streams these protocol items. It does not decide that a function should run on your machine. That boundary matters. A gateway should preserve the protocol shape. It should not secretly become a shell, filesystem, browser, git client, or code execution environment unless that is explicitly part of its scope. This is also where details like tool identity and namespace preservation matter. Some clients, especially Codex-style or MCP-style clients, may rely on structured tool identity rather than trying to reverse-engineer it from a flattened function name. For example, preserving something like this is cleaner: { "name": "search", "namespace": "mcp__github" } than forcing clients to infer structure from something like this: { "name": "mcp__github__search" } That may look like a small protocol detail, but small shape differences are often what break strict clients. Respawn’s position is: preserve Responses-shaped data, store it, stream it, replay it, and let clients own function execution. Local tools, not hosted tools OpenAI’s hosted tools are useful because they are integrated into the platform. Local LLM stacks need a different model. Respawn supports opt-in local web_search and image_generation paths, but they are intentionally not magic hosted tools. For search, Respawn can use configured local providers such as mock providers or SearXNG-style query providers. It emits Responses-style web_search_call data and can attach URL citations. For image generation, Respawn can use configured local image backends such as ComfyUI or Automatic1111. It emits Responses-style image_generation_call output with generated image data. The important part is the boundary: Ollama handles model inference. Search is handled by an explicitly configured local search provider. Image generation is handled by an explicitly configured local image backend. Respawn exposes these flows through a Responses-shaped API. That keeps the architecture understandable. There is no assumption that the model backend itself should suddenly become a browser, image generator, search engine, file system, and runtime. Streaming is part of the contract Streaming sounds simple until you need multiple clients to depend on it. Backend-native streams vary. Some stream text chunks. Some stream JSON lines. Some have provider-specific deltas. Some expose tool-call deltas differently. For humans watching text appear in a terminal, that may be fine. For applications, agents, and SDKs, the event shape matters. Respawn normalizes streaming into Responses-style lifecycle events. It emits stable Server-Sent Events with sequence numbers, text deltas, output item events, failure events, incomplete events, and function-call argument deltas. That gives clients something consistent to consume. This matters for coding agents and internal tools because they often do more than print tokens. They track items, update UI state, detect tool calls, collect usage, react to failures, and resume or inspect work. A local platform layer should make that predictable. Background jobs and lifecycle state Another difference between a generation endpoint and a platform API is lifecycle. Sometimes a request should return immediately and be polled later. Sometimes it should be cancellable. Sometimes a stored response should be retrieved after completion. Sometimes metrics should show whether background jobs are failing or timing out. Respawn supports background=true for local background Responses. A background response is stored, processed locally, and can be retrieved later. It can also be cancelled through the API. This is not distributed job orchestration. Jobs are local to the current Respawn process and configured backend. That limitation is intentional. Respawn is currently designed for one gateway instance connected to one configured model backend. It does not claim multi-replica consistency or distributed prompt cache behavior. The point is not to pretend a local stack is a global hosted platform. The point is to provide useful, explicit, testable platform semantics locally. Files, images, and context planning Respawn includes a local Files API subset. Files can be uploaded, listed, retrieved, downloaded, and deleted. File and image inputs can be normalized before they reach the backend. Vision capability checks can be applied based on the configured model. Respawn also includes local prompt templates, context planning, truncation, compaction, and prompt-cache accounting. The prompt-cache accounting is local accounting. It does not reuse backend KV tensors or skip prefill work. That distinction matters. Again, Respawn is not trying to take over the model runtime. It tries to provide API behavior, accounting, and compatibility around the runtime. Observability should not be an afterthought If you run local LLMs for experiments, logs are enough. If you run local LLMs as part of a development workflow, internal tool, or production-like environment, you need operational signals. Respawn emits: structured JSON request logs x-request-id headers HTTP metrics endpoint metrics latency metrics in-flight request metrics error metrics response metrics by model and mode backend metrics by backend, model, operation, and status readiness checks background job metrics streaming metrics file-storage metrics prompt-cache metrics The Docker stack provisions VictoriaMetrics and a Grafana dashboard. That may sound boring compared to model quality or inference speed, but it is the kind of boring that makes systems usable. When a local model stack fails, I want to know where: Did the request fail validation? Did the backend time out? Did streaming fail mid-response? Did a background job expire? Is the database unavailable? Is the backend reachable? Which model is producing errors? Are tool flows malformed? Are files being rejected? Did a compatibility path regress? Those are gateway questions, not model questions. Compatibility needs tests, not vibes One of the reasons I built Respawn as a separate gateway is that API compatibility can be tested independently from model quality. Respawn includes a real-backend benchmark suite that calls the gateway over HTTP and validates feature behavior, latency, SDK contract paths, metrics, and operational behavior. There is also a machine-readable compatibility manifest exposed through: GET /compatibility/responses This is important because “compatible” should not be a vague promise. A local gateway can say which features are supported, which are conditional, and which are explicitly unsupported. Respawn deliberately does not implement everything. Out of scope today: OpenAI Conversations API audio/realtime APIs browser actions general hosted tool execution arbitrary shell/filesystem/git/workspace execution distributed prompt caches dynamic backend routing multi-replica consistency That list is not an apology. It is part of the contract. A small local gateway should be clear about what it does and does not own. Testing with Codex locally One of the most useful smoke tests for this kind of gateway is a real client that expects Responses-style behavior. I tested Codex locally by pointing ~/.codex/config.toml at Respawn as the base URL and using the Responses wire API. The setup looked like this: Codex -> Respawn as the OpenAI-shaped base URL -> Ollama as the local model backend In my local smoke tests, Codex -> Respawn -> Ollama appears to work end-to-end, including Responses-style tool flows. I also tested Respawn’s opt-in local web_search and image_generation paths. Those are handled by configured local providers, not by Ollama itself. This is not a certification claim. Codex and Responses clients can change, and compatibility work is never finished. But it is a useful signal that a separate gateway layer can make local backends usable with stricter clients without forcing every inference server to implement the whole platform surface itself. Why not put all of this into the inference backend? That is a fair question. Some of this behavior could be implemented directly in Ollama, vLLM, llama.cpp, or other backends. But I think there is a strong argument for keeping the layers separate. Inference backends are already solving hard problems: model loading quantization GPU scheduling batching KV cache behavior context windows sampling tokenizer quirks throughput memory pressure hardware compatibility Those are deep runtime concerns. A Responses-compatible platform layer has different concerns: request validation response storage lifecycle endpoints previous-response reconstruction streaming event normalization tool protocol shape file input normalization OpenAI-shaped errors request IDs idempotency tenant scoping metrics compatibility tests Those are gateway concerns. Putting everything into every backend risks duplicating the same API-state machinery across multiple inference runtimes. A sidecar/gateway pattern keeps the contract in one layer and lets the backend focus on generation. Today Respawn targets Ollama first. The same model could apply to future adapters such as vLLM. What Respawn is not Respawn is not a replacement for Ollama. It is not a replacement for vLLM. It is not a model runtime. It is not a multi-provider enterprise router. It is not trying to make the model smarter. It is not trying to make inference faster. It is not trying to hide the fact that local backends have different capabilities. Respawn makes local inference easier to integrate, observe, test, and swap behind an API contract. That is the value proposition. If all you need is a simple local prompt, direct Ollama is probably enough. If you need OpenAI SDK ergonomics, stateful Responses behavior, stored responses, tool protocol shape, normalized streaming, background jobs, files, request IDs, metrics, and compatibility gates, a gateway layer starts to make sense. What I am looking for I am especially interested in feedback from people running local agent stacks, coding tools, or internal AI platforms. Some questions I am trying to answer: Should Responses state live in the inference backend or in a sidecar gateway? Which clients are strict enough that shape differences break them? How should local gateways handle tool identity and namespaces? What is the right boundary between local tool providers and model backends? Which Responses features matter most for real local workflows? Would a vLLM adapter be useful? What observability signals are missing from local LLM gateways? Local LLMs are getting better quickly. But if we want them to plug into modern tools, agents, and internal systems, we need more than fast token generation. We need local platform semantics too. Respawn is my attempt at that layer. Not a smarter model. Not a faster runtime. A local API contract around self-hosted inference. The project is open source and MIT licensed. Repository: https://github.com/robertomanfreda/respawn \ \
View original source — Hacker Noon ↗


