
Almost every "build an AI receptionist" tutorial ends the moment the AI picks up the phone. Production agents fail somewhere those tutorials never reach: handing the call to a human. The caller has spent ninety seconds explaining that they're calling about a crown that fell out, they have a flight tomorrow, and their insurance changed last month — and then the agent says "let me transfer you," the hold music plays, and a front-desk staffer answers with "Hi, how can I help you?" The caller has to start over. That's a cold transfer , and it's the single fastest way to make an AI voice agent feel worse than no agent at all. A warm handoff is the opposite: the human takes over already knowing who's on the line, why they called, and what's been said. This tutorial builds that handoff on the AssemblyAI Voice Agent API for the conversation, with Twilio for the telephony. We'll cover the two transfer mechanisms — SIP REFER versus a conferenced bridge — show how to keep a live transcript running across the bridge with Universal-3 Pro Streaming , and end with a working dental-office receptionist. The companion repository is linked at the end. One thing to get straight up front, because it shapes the entire architecture: the Voice Agent API has no native "transfer to human" primitive. It is a conversation engine — STT, LLM, TTS, turn detection, and tool calling over one WebSocket. It does not own the PSTN connection, so it cannot route a call. The transfer happens at the telephony layer. The agent's only job is to decide a transfer should happen and package the context ; your code and Twilio do the rest. Build it that way and the pattern is clean. Try to find a "transfer" field in the API and you'll waste an afternoon. What a warm handoff is A warm handoff (or warm transfer) is a call transfer in which the receiving human is given the caller's context — identity, intent, and conversation history — before or at the moment they take over, so the caller never has to repeat themselves. A cold transfer drops the caller onto a new person with no context. For an AI voice agent, "context" has three parts, and a good handoff delivers all three: Identity and intent : who is calling and what they want, ideally as a one-line summary the human can read in under two seconds. Conversation history : what the agent already collected — the appointment date it offered, the insurance ID it captured, the symptom the caller described. A live transcript : if the human joins a call still in progress, they should see words appearing in real time, not a frozen snapshot from the moment of transfer. The first two are data you already have from the conversation. The third is the hard part, and it's where the choice of transfer mechanism decides everything. The two transfer mechanisms transfer_to_human tool fires │ ┌───────────────┴────────────────┐ ▼ ▼ SIP REFER Conferenced bridge (carrier re-bridges legs) (you hold the call in a room) ───────────────────────── ───────────────────────────── • Your media path drops • Your media path stays alive • AI + transcript session end • Transcript keeps running • Caller ↔ human, you're out • Caller ↔ human ↔ your fork • Cheapest (off the call) • You pay for the held legs • Context must be pre-sent • Context + live transcript (screen-pop before release) delivered across the handoff • Best: human is on a PBX • Best: warm intro, live assist, that owns the call after compliance recording, QA \ SIP REFER is the telephony standard for "you take this call, I'm stepping out." Your media server (or Twilio, via a cold-transfer call redirect) tells the carrier to re-bridge the caller to the human's number. Once the REFER completes, your bridge and the Voice Agent API session are gone — you're no longer in the media path. That's efficient and cheap, but it means the live transcript ends at the moment of transfer. If you use REFER, you must deliver context before you release: a screen-pop to the human's CRM, a summary pushed to their PBX, something out-of-band. A conferenced bridge keeps you in the call. Instead of handing the caller off and leaving, you move the caller into a conference room and dial the human into the same room. Because your server is still a participant, you can keep forking the audio to a Universal-3 Pro Streaming session and keep producing a live transcript the whole time. It costs more — you're paying to hold the legs — but it's the only way to give the human a transcript that's still updating as the caller speaks. For a genuinely warm handoff where context continuity is the point, the conference is the right default. We'll build that path in full and show the REFER path as the cheaper alternative. Architecture Caller ──PSTN──> Twilio number │ ┌─────────┴───────── AI phase ──────────┐ │ <Connect><Stream> (bidirectional) │ ▼ │ bridge_server.py ──ws──> Voice Agent API │ │ STT + LLM + TTS │ │ + tool calling │ │ │ │ tool.call: transfer_to_human(summary) │ └──────────────────┬───────────────────────┘ ▼ redirect caller → <Conference> + <Start><Stream> forks leg audio │ ┌────────────────┼─────────────────────┐ ▼ ▼ ▼ Human dialed transcript_server.py Human's screen into conference Universal-3 Pro • context card (caller ↔ human) Streaming (u3-rt-pro) • LIVE transcript speaker_labels: true (updates in real time) \ Three components do the work. The Voice Agent API runs the conversation and fires a tool call when it's time to transfer. The bridge server translates between Twilio Media Streams and the Voice Agent API, and orchestrates the telephony transfer when the tool fires. The transcript server consumes a forked copy of the conference audio and runs Universal-3 Pro Streaming so the human sees a live, speaker-labeled transcript. Before you start You'll need: An AssemblyAI account with Voice Agent API access A Twilio account with a voice-capable number A human destination — a phone number, or a SIP endpoint if you're testing the REFER path Python 3.11+ Install: pip install fastapi uvicorn "websockets>=14" python-dotenv twilio numpy Step 1: Define the transfer tool The agent needs exactly one tool for handoff. Its job is to signal that a transfer should happen and to package the context. Notice it takes a summary argument — we make the LLM write the one-line context card as part of the call, so the human gets a clean summary instead of a raw transcript dump. # tools.py TRANSFER_TOOL = { "type": "function", "name": "transfer_to_human", "description": ( "Transfer the caller to a human staff member. Call this when the " "caller explicitly asks for a person, when the request is clinical, " "a billing dispute, or an emergency, or when you cannot resolve the " "request after two attempts. Always write a clear one-line summary " "of who is calling and why before transferring." ), "parameters": { "type": "object", "properties": { "reason": { "type": "string", "enum": ["asked_for_human", "clinical", "billing_dispute", "emergency", "unresolved"], }, "summary": { "type": "string", "description": "One sentence: who is calling and what they need. " "Example: 'Maria Lopez, existing patient, crown fell " "out, flying tomorrow, wants an urgent appointment.'", }, "callback_number": {"type": "string"}, }, "required": ["reason", "summary"], }, } The summary field is the entire warm-handoff payload in one string. Because the LLM produces it from the conversation it just had, it's already distilled — which is exactly what a busy human wants to read, not a 90-second transcript. Step 2: The AI phase — bridge Twilio to the Voice Agent API The AI handles the call first. This is the standard bridge: Twilio sends G.711 μ-law at 8 kHz, and the Voice Agent API accepts it natively when you set the encoding to audio/pcmu . A few details specific to this endpoint: The auth header is Authorization: Bearer YOUR_KEY — note the Bearer prefix, which is unique to the Voice Agent API. The first message is a session.update event with all config nested under a session object. There is no session.start . Wait for session.ready before sending any input.audio frames. Telephony audio is audio/pcmu (8 kHz μ-law). Input and output encodings are configured independently — see the audio format reference . # bridge_server.py import asyncio, json, os import websockets from fastapi import FastAPI, Query, Request, WebSocket from fastapi.responses import Response from twilio.rest import Client from prompts import SYSTEM_PROMPT from tools import TRANSFER_TOOL VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws" ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"] twilio = Client(os.environ["TWILIO_SID"], os.environ["TWILIO_TOKEN"]) HUMAN_NUMBER = os.environ["HUMAN_AGENT_NUMBER"] app = FastAPI() @app.post("/twilio/voice") async def twilio_voice(request: Request): host = request.url.hostname twiml = f"""<?xml version="1.0" encoding="UTF-8"?> <Response> <Connect> <Stream url="wss://{host}/media-stream" /> </Connect> </Response>""" return Response(content=twiml, media_type="application/xml") @app.websocket("/media-stream") async def media_stream(twilio_ws: WebSocket): await twilio_ws.accept() call_sid = {"value": None} stream_sid = {"value": None} session_config = { "type": "session.update", "session": { "system_prompt": SYSTEM_PROMPT, "tools": [TRANSFER_TOOL], "input": {"format": {"encoding": "audio/pcmu"}}, "output": {"voice": "sophie", "format": {"encoding": "audio/pcmu"}}, }, } async with websockets.connect( VOICE_AGENT_WS, additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_KEY}"}, ) as va_ws: await va_ws.send(json.dumps(session_config)) ready = asyncio.Event() transferring = asyncio.Event() pending_transfer = {"args": None} async def pump_twilio_to_va(): async for raw in twilio_ws.iter_text(): event = json.loads(raw) kind = event.get("event") if kind == "start": stream_sid["value"] = event["start"]["streamSid"] call_sid["value"] = event["start"]["callSid"] elif kind == "media" and ready.is_set(): await va_ws.send(json.dumps({ "type": "input.audio", "audio": event["media"]["payload"], })) elif kind == "stop": return async def pump_va_to_twilio(): async for raw in va_ws: event = json.loads(raw) t = event.get("type") if t == "session.ready": ready.set() elif t == "reply.audio" and stream_sid["value"]: await twilio_ws.send_text(json.dumps({ "event": "media", "streamSid": stream_sid["value"], "media": {"payload": event["data"]}, })) elif t == "tool.call" and event["name"] == "transfer_to_human": # Stash the transfer; fire it on the next reply.done so the # agent's "one moment" line finishes playing before we move # the call. Acting here would clip the handoff line. pending_transfer["args"] = event.get("arguments", {}) elif t == "reply.done": if pending_transfer["args"] and not transferring.is_set(): transferring.set() asyncio.create_task( start_warm_transfer(call_sid["value"], pending_transfer["args"]) ) await asyncio.gather(pump_twilio_to_va(), pump_va_to_twilio()) The timing here is deliberate. When transfer_to_human fires, we don't move the call immediately — we stash the arguments and wait for the next reply.done . That's because the agent typically speaks a handoff line ("I've got our front desk for you — one moment") in the same turn it calls the tool. Redirecting the Twilio leg the instant the tool.call arrives would tear down the <Connect><Stream> mid-sentence and clip that line. Acting on the following reply.done lets the audio finish first. The summary travels in pending_transfer["args"]["summary"] . Notice we never send a tool.result back for transfer_to_human . That's intentional, and it's the one place this tool breaks the normal pattern. Redirecting the caller's Twilio leg ends the <Connect><Stream> and tears down the Voice Agent session, so there's no live session left to receive a result — transfer_to_human is a terminal tool. Non-terminal tools (booking a slot, looking up a patient) are different: for those you collect each tool.call , run the work, and return a tool.result keyed by call_id — the tool-calling docs cover that round-trip. A terminal transfer is the exception, not the rule. Step 3: Execute the warm transfer (conferenced bridge) This is the part the tutorials skip. We redirect the caller's live call into a conference, fork the conference audio to our transcript server, and dial the human in. The caller never hears a disconnect — the AI's voice fades into hold audio, then the human joins. # transfer.py (called from bridge_server) PUBLIC_HOST = os.environ["PUBLIC_HOST"] # your https/wss host CONTEXT_CARDS = {} # conference_name -> summary, read by the human's screen async def start_warm_transfer(call_sid: str, args: dict): conference = f"handoff-{call_sid}" CONTEXT_CARDS[conference] = { "reason": args.get("reason"), "summary": args.get("summary"), "callback": args.get("callback_number"), } # 1. Redirect the caller into a conference, and fork the leg audio # to our transcript server with <Start><Stream>. caller_twiml = f"""<?xml version="1.0" encoding="UTF-8"?> <Response> <Start> <Stream url="wss://{PUBLIC_HOST}/transcript-stream?room={conference}" /> </Start> <Dial> <Conference startConferenceOnEnter="false" waitUrl="https://twimlets.com/holdmusic">{conference}</Conference> </Dial> </Response>""" twilio.calls(call_sid).update(twiml=caller_twiml) # 2. Dial the human into the same conference. They enter and the # conference starts; the caller comes off hold automatically. twilio.calls.create( to=HUMAN_NUMBER, from_=os.environ["TWILIO_FROM"], twiml=f"""<?xml version="1.0" encoding="UTF-8"?> <Response> <Dial> <Conference startConferenceOnEnter="true" endConferenceOnExit="true">{conference}</Conference> </Dial> </Response>""", ) Two things make this a warm transfer rather than a fancy cold one: <Start><Stream> forks a one-way copy of the caller's leg audio to /transcript-stream while the conference proceeds. Unlike <Connect><Stream> (which takes over the call), <Start> runs alongside the <Dial> , so we keep transcribing without interfering with the live audio. The context card ( args["summary"] ) is stored under the conference name the instant the tool fires — before the human's phone even rings. By the time they pick up, their screen already shows "Maria Lopez, existing patient, crown fell out, flying tomorrow." Step 4: Keep the transcript live with Universal-3 Pro Streaming The forked audio lands on /transcript-stream . We open a standalone Universal-3 Pro Streaming session and relay the transcript to the human's browser. This is a different endpoint from the Voice Agent API — the standalone Streaming API at wss://streaming.assemblyai.com/v3/ws . Two parameters matter most here: speech_model is required — there is no default. Use u3-rt-pro for Universal-3 Pro Streaming. speaker_labels: true turns on Streaming Diarization , so each turn is tagged with a speaker ( A , B ) — letting the human see who said what across the caller-and-colleague conversation. One codec detail that trips people up: Twilio's forked audio is 8 kHz G.711 μ-law, but the standalone Streaming API examples stream 16-bit PCM. We decode μ-law to PCM16 with audioop.ulaw2lin before forwarding, and we tell the API the true sample rate — sample_rate=8000 . (Don't upsample to 16 kHz hoping for better accuracy; you can't invent high-frequency detail that telephony already discarded. Just declare the real rate.) # transcript_server.py import audioop, base64, json, os, asyncio import websockets from urllib.parse import urlencode from fastapi import WebSocket, Query STREAMING_WS = "wss://streaming.assemblyai.com/v3/ws" ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"] # room -> set of human dashboard websockets subscribed to this call SUBSCRIBERS = {} @app.websocket("/transcript-stream") async def transcript_stream(twilio_ws: WebSocket, room: str = Query(...)): await twilio_ws.accept() params = urlencode({ "speech_model": "u3-rt-pro", # Universal-3 Pro Streaming "sample_rate": 8000, # match the telephony source — don't upsample "speaker_labels": "true", # who said what across the handoff "max_speakers": 2, # caller + human "format_turns": "true", }) async with websockets.connect( f"{STREAMING_WS}?{params}", additional_headers={"Authorization": ASSEMBLYAI_KEY}, # raw key, no "Bearer" ) as aai_ws: async def forward_audio(): async for raw in twilio_ws.iter_text(): event = json.loads(raw) if event.get("event") == "media": mulaw = base64.b64decode(event["media"]["payload"]) pcm16 = audioop.ulaw2lin(mulaw, 2) # μ-law -> 16-bit PCM await aai_ws.send(pcm16) async def relay_transcript(): async for raw in aai_ws: msg = json.loads(raw) if msg.get("type") == "Turn": line = { "speaker": msg.get("speaker_label", "?"), "text": msg.get("transcript", ""), "final": msg.get("end_of_turn", False), } for ws in list(SUBSCRIBERS.get(room, [])): await ws.send_text(json.dumps(line)) await asyncio.gather(forward_audio(), relay_transcript()) Note the auth difference: the standalone Streaming API takes the raw API key in the Authorization header — no Bearer prefix. That Bearer prefix is specific to the Voice Agent API. Mixing them up is the most common reason a copy-pasted snippet returns a 401. The human's dashboard subscribes to /dashboard/{room} , renders the context card from CONTEXT_CARDS[room] the moment the page loads, and appends each Turn line as it arrives. By the time they say "Hi Maria," they've read the summary and watched the last few exchanges scroll by. Step 5: The dental-office receptionist Here's the system prompt that ties it together. The agent handles the routine front-desk work — hours, scheduling, "are you taking new patients" — and transfers the moment a request crosses into clinical, billing, or emergency territory. # prompts.py SYSTEM_PROMPT = """You are the virtual receptionist for Bright Smile Dental. You are warm, brief, and efficient. One or two sentences per turn. YOU CAN HANDLE: - Office hours (Mon-Thu 8-5, Fri 8-1), location, parking - Whether we're accepting new patients (we are) - Booking, rescheduling, and confirming routine cleanings and checkups - Telling callers what to bring (ID, insurance card) YOU MUST TRANSFER TO A HUMAN (call transfer_to_human) WHEN: - The caller asks to speak to a person - The request is clinical: pain, swelling, a lost crown/filling, bleeding, post-surgery questions, medication questions - It's a billing dispute or an insurance question you can't answer from the schedule - It's an emergency (use reason="emergency") - You've tried twice and can't resolve the request BEFORE YOU TRANSFER: - Tell the caller: "Let me get one of our team members for you — one moment." - Then call transfer_to_human with a one-line summary of who is calling and why. Write the summary so a colleague can read it in two seconds. NEVER diagnose, never quote a clinical opinion, never guess at insurance coverage. When in doubt, transfer. """ Run through it: a caller says "my crown just fell out and I'm flying tomorrow morning." The agent recognizes this is clinical, says "Let me get one of our team members for you — one moment," and calls transfer_to_human(reason="clinical", summary="Maria Lopez, existing patient, crown fell out, flying tomorrow AM, wants an urgent slot") . The caller hears a few seconds of hold music. The front-desk staffer's screen lights up with the summary, then a live transcript. They pick up: "Hi Maria, I hear your crown came out and you're flying tomorrow — let's get you in this afternoon." No repetition. That's the warm handoff. When to use REFER instead The conference path keeps the transcript alive, but you pay to hold the legs. Use SIP REFER (or a Twilio cold-transfer call redirect to <Dial><Number> ) when: The human sits behind a PBX or contact-center platform that will own the call after transfer, and that system has its own screen-pop from your CRM. You don't need a live transcript after the handoff — the summary delivered out-of-band is enough. Call volume and cost dominate, and staying in the media path for every transfer is too expensive. The trade-off is hard: once REFER completes, you're out of the media path, so the Universal-3 Pro Streaming session ends with it. Deliver context before you release — push the summary to the human's CRM or PBX screen-pop, then issue the REFER. If you release first and try to send context after, you've built a cold transfer with extra steps. Measuring success Three numbers tell you whether your handoff is actually warm: Repeat rate : the share of transferred calls where the caller re-states something the agent already captured. The whole point is to drive this toward zero. Read transcripts to measure it. Time-to-context : how long after the human answers before they speak the caller's name or intent. With the context card pre-loaded, this should be near zero. A long pause means your screen-pop is arriving late. Transfer precision : of the calls the agent transferred, how many genuinely needed a human? Too many transfers means the prompt's transfer triggers are too broad; too few (callers asking twice) means they're too narrow. The complete repository Fork the runnable repo at github.com/kelsey-aai/the-warm-handoff-ai-voice-agent-human-transfer . It includes the bridge server, the transfer orchestration, the transcript server with Universal-3 Pro Streaming, a minimal human dashboard that renders the context card and live transcript, and the dental-office prompt. Around 450 lines of Python total. Frequently asked questions How do I transfer an AI voice agent to a human without losing context? Trigger the transfer with a tool call from the agent, then execute it at the telephony layer — the Voice Agent API has no native transfer primitive. Give the agent a transfer_to_human tool whose arguments include a one-line summary, so the LLM packages the context as part of the conversation. For a warm handoff that preserves a live transcript, move the caller into a conference (Twilio <Conference> ), fork the audio with <Start><Stream> , and run a standalone Universal-3 Pro Streaming session that relays a speaker-labeled transcript to the human's screen. The summary and live transcript reach the human before they speak, so the caller never repeats themselves. Does the AssemblyAI Voice Agent API have a built-in "transfer to human" feature? No. The Voice Agent API is a conversation engine — speech-to-text, LLM, text-to-speech, turn detection, and tool calling over a single WebSocket. It does not own the telephony connection, so it cannot route or transfer a call. The standard pattern is to define a transfer_to_human function tool, detect the tool call in your bridge, and perform the actual transfer with your telephony provider (Twilio SIP REFER or a conferenced bridge). The agent decides that a transfer should happen and packages the context; your code and Twilio do the routing. What's the difference between a warm transfer and a cold transfer for AI agents? A cold transfer drops the caller onto a human with no context — the caller has to repeat everything. A warm transfer gives the human the caller's identity, intent, and conversation history before or as they take over. For an AI voice agent, a warm handoff means delivering three things: a one-line summary the agent wrote, the structured data it captured, and (if the human joins a live call) a real-time transcript. The conferenced-bridge mechanism preserves a live transcript; SIP REFER does not, so with REFER you must deliver context out-of-band before releasing the call. How do I keep a live transcript running after the AI hands off the call? Keep your server in the media path. With a conferenced bridge, move the caller into a Twilio <Conference> and use <Start><Stream> to fork the leg audio to a transcription endpoint, then run a standalone Universal-3 Pro Streaming session ( wss://streaming.assemblyai.com/v3/ws , speech_model=u3-rt-pro ) with speaker_labels: true so the human sees who said what. Because telephony audio is 8 kHz G.711 μ-law, decode it to 16-bit PCM with audioop.ulaw2lin and set sample_rate=8000 to match the source. A SIP REFER transfer ends your media path, so the live transcript stops at the moment of transfer — use the conference if a live transcript matters. Why does my standalone Streaming API connection return a 401 when the Voice Agent API works fine? The two APIs authenticate differently. The Voice Agent API expects Authorization: Bearer YOUR_KEY — with the Bearer prefix. The standalone Streaming API expects the raw API key in the Authorization header, with no prefix. Copy-pasting the Bearer version into a Streaming connection is the most common cause of a 401 when bridging the two, as you do in a warm-handoff transcript pipeline. \n \ \
View original source — Hacker Noon ↗

