Why the next leap in AI video is teaching avatars to see and listen

TL;DR

AI video is shifting from a fidelity race to an interactivity race. A new class of interactive avatar models can be graded on three levels: Level 1 (talk), Level 2 (talk and listen), and Level 3 (talk, listen, and see). The jump from Level 1 to Level 2, where an avatar learns to listen and react in real time, is the breakthrough that turns a talking face into a convincing conversational counterpart.

For the past few years, progress in generative video and AI avatars has been measured almost entirely in fidelity, with each new model making significant progress in delivering sharper detail, better physics, and smoother motion packaged in longer clips. That race is far from over, but it is starting to miss a more interesting direction. Video, as an online media format, is evolving from a static, broadcast-like experience to a more interactive one.

Software is increasingly mediated by agents rather than by buttons and menus, and for nearly any workflow you can name, someone is building an agent to handle it. In parallel, hybrid architectures that blend autoregressive and diffusion methods have become one of the liveliest areas of video research. And a growing set of teams are treating interactive video as a foundation for entirely new application classes, from open world simulation to live dialogue. Put those together and the conclusion is fairly clear: interactivity, not resolution, is becoming the frontier.

As a result, a new category of video models are emerging whose job is to produce a talking agent that reacts to a human in real time, at latencies low enough to sustain natural conversation, usually under a second. Similarly to how self-driving cars are defined by six levels of automation, these Interactive Avatar Models come in three levels of interactivity defined by their technical capabilities.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

A Level 1 system can talk. It is driven entirely by its own audio and has no awareness of the person in front of it. Almost every talking avatar system available today achieves this level of performance. It is a one-way generation problem: given speech, produce a plausible talking face.

A Level 2 system can talk and listen. It takes in the user’s audio as well as its own, and it reacts while the other person is speaking. These reactions include small visual signals that real listeners produce such as a nod of agreement or a shift in expression, and with vocal cues like a brief “mhm” to show acknowledgement. This is a fundamentally harder problem than Level 1, because the model is no longer generating in isolation. It has to interpret an incoming signal and respond to it continuously, in time.

A Level 3 system can talk, listen, and see. On top of audio, it takes the user’s camera feed, so it can respond to posture, gesture, and facial expression the way people adjust to each other on a video call.

The reason we want to evolve beyond Level 1 models is because an avatar that talks without any awareness of the person it is talking to looks alive without being responsive. It moves while you are speaking, often in ways that have nothing to do with what you are saying, and the effect is surprising or unsettling. Set against audio-only conversational systems, which at least stay quiet and attentive while you talk, a non-listening avatar can sometimes feel worse than no avatar at all.

That is why the jump from Level 1 to Level 2 is the one that matters most. Making an avatar listen convincingly is what turns a talking face into something that feels like a counterpart. Achieving that is harder than it sounds, because listening is not purely visual. The vocal side, the timing of an interruption, the prosody of an acknowledgement, the half-second pause before a reaction carry as much of the sense of engagement as the nodding does. The naive approach is to bolt a conversational voice system onto a video model in a stack. The more promising path is to model audio and motion jointly, learning how voice and movement shape each other in real time. The lesson from recent multimodal video models is that predicting both modalities together is often where realism crosses a threshold rather than inching forward.

Level 3 avatar models can use the video feed from a person’s camera to create the ultimate conversational experience which perfectly replicates a video call. For example, imagine you are talking to someone; if they stand up and leave then naturally you stop talking because that’s a clear signal that the conversation is over. Therefore, Level 3 interactive avatars not only react to a person’s emotions or tone of voice, but also to what the user is doing. As a result, they can fully model human to human interactions.

Building toward Level 3 is among the most ambitious problems in applied video research, and getting there will take sustained, compounding work across data, models, and systems engineering, something that Synthesia has an excellent track record in.

View original source — The Next Web ↗

ShareShare on X Share on Facebook