v1.0.13 · l-0003May 11, 2026filed under lens

The avatar rig

How LENS builds the companion's face. MPFB, Azure visemes, and the eight-hour latency budget that decides whether you believe it.

The companion avatar is the most psychologically important surface in the app. It sits in the room, you look at it, it looks back. The fidelity decides whether you believe the instrument is paying attention to you. Below the threshold of belief, the conversation never gets started; above it, the rest of the app inherits the trust.

The rig

The base mesh is built on MorphPoseFaceBody (MPFB), an open-source Blender add-on for parametric human bodies. MPFB gives a single rig with FACS-mapped blendshapes — the action units cinema and behavioural science have been using for fifty years to encode facial expression. The rig exports to USDZ for placement in RealityKit and to glTF for the web viewer.

We don't try to make the avatar look photoreal. Photoreal is the wrong target — it falls into the uncanny valley before it gets there. We aim for handcrafted: simplified geometry, rim lighting that holds steady, eyes that meet yours but don't try to imitate skin. The companion is allowed to look like a companion, not like a person.

What drives the face

The avatar's mouth is driven by Azure Speech's viseme stream. Azure breaks a TTS utterance into phonetic units and emits a 22-class viseme at each frame's worth of audio. We map those 22 visemes onto MPFB's blendshape weights via a hand-tuned matrix that compensates for MPFB's bias toward neutral expression. The result is lip-sync that lands within ~40ms of the audio playback.

The voice itself is ElevenLabs. The conversation — the words being said — comes from Claude Sonnet 4.5, dispatched through voir-proxy with a system prompt that holds the companion's register. The audio synthesis can run while Claude is still composing the next sentence, so the gap between the user finishing speaking and the avatar replying is dominated by network round-trips, not by TTS latency.

The latency budget

A conversation feels alive when the response lands within a listener's tolerance. Telephony research puts that at ~600ms. The instrument's budget breaks down roughly like:

STT (Apple Speech, on-device): 120ms
Claude first-token: ~280ms (the long pole)
TTS first-chunk (ElevenLabs streaming): ~80ms
Avatar viseme rig kickoff: ~20ms
Audio playback latency (iOS AVAudioEngine): ~30ms

Total floor: roughly 530ms in the best case. We've measured median end-to-end at ~720ms on Wi-Fi with an iPhone 15. The Claude leg is by far the most variable. We can shrink the budget further by streaming Claude's response token-by-token directly into ElevenLabs' websocket interface, which removes the wait-for-complete-sentence step — that's on the roadmap.

Where the rig lives

Cross-mode persistence is handled by VoirCompanionStore — a single source of truth for the companion's current pose, mood, conversation history, and place-in-space anchor. When you open SpaciAR, the companion is already there. When you swap to TrainAR, it keeps watching. This is a small but expensive guarantee: the rig has to survive view-controller swaps, kit context transitions, and ARKit session restarts without resetting.

What the web can show

Everything described above happens on the device. The Lens page shows a static placeholder for the avatar plus a written description of the pipeline. The avatar mouthing along to its own page-headline is the only animated trace on the web — a tiny demonstration of fidelity without trying to compete with the on-device experience.