VOIR · RESEARCH REGISTRY · OPEN CORPUS

A reading of the field.

Eighty-eight short papers on the techniques and instruments shaping augmented reality, immersive computing, and on-device perception. The lab observes; the lab summarizes; the lab declines to disclose how the methods are combined behind the wall.

VOLUME: 88 entries
WINDOW: 2025 – 2026
DOMAINS: nine
FORMAT: survey · digest · field note

88 of 88 entries shown

RP-050MMXXVI · v · 14

λ·VMultimodal & Vision-Language

When the Model Should Not Speak: Refusal in Spatial AI

VLMs generate fluent text whether or not the underlying perception supports it. The paper examines refusal training and uncertainty-aware decoding for spatial tasks. The lab argues that interactive AR favors *short refusals* over long confabulations, and proposes a per-token entropy gate.

refusal training
uncertainty quantification
entropy gating
abstention
calibration

RP-010MMXXVI · v · 12

λ·ILocalization & SLAM

The Cold-Start Problem in Persistent AR

First-launch re-localization is the most user-hostile moment in persistent AR: nothing is known about the room until the device sweeps it. The paper digests four cold-start strategies: dense priming sweeps, gravity-aligned mini-maps, server-side coarse priors, and IMU-only blind tracking until the first feature lock. None is satisfying alone; the lab argues for a hybrid with explicit user signaling on the first session.

cold start
gravity-aligned map
blind tracking
feature priming
VIO

RP-040MMXXVI · v · 09

λ·IVEmbodied RL & World Models

When Not to Use RL: A Decision Note

Reinforcement learning is tempting for spatial tasks but is rarely the cheapest answer. The paper catalogues five tasks where the lab considered RL and chose classical control or imitation instead, with rationale. A decision tree is offered: if the reward is dense, the demonstrations exist, and the simulator is faithful, RL is reasonable; otherwise, it is not.

classical control
MPC
imitation learning
decision tree
RL

RP-076MMXXVI · v · 06

λ·VIIIInteraction, Hand & Display

Eye Tracking on Mobile: The Unsolved Case

Eye tracking on phone front cameras is feasible at coarse resolution; gaze prediction within a held device is dominated by head pose, not eyeball pose. The paper reviews mobile-gaze methods (GazeCapture, OpenGaze) and their characteristic failure modes.

GazeCapture
OpenGaze
EyeNet
appearance-based gaze
calibration-free gaze

RP-030MMXXVI · v · 02

λ·IIIOn-device Vision

Distillation Pipelines: Teacher → Student → Edge

Knowledge distillation has become a default tool for shipping research-grade models to mobile. The paper reviews logit distillation, feature distillation, attention transfer, and self-distillation. A finding: distillation works best when the student architecture is similar to the teacher; cross-family distillation is brittle.

logit distillation
feature distillation
attention transfer
self-distillation
DistilBERT-style

RP-084MMXXVI · iv · 30

λ·IXRendering & Optics

Volumetric Rendering at Phone Frame Rates

Volumetric AR effects (smoke, light shafts, transparency) remain expensive on mobile. The paper indexes density-grid rendering, sparse voxel approaches (NSVF), and signed-distance volumes. Findings: small volumes are cheap; large volumes are not yet tractable.

density grid
NSVF
SDF rendering
ray marching
volume rendering

RP-068MMXXVI · iv · 25

λ·VIISensors, VIO & Calibration

Polarization as a Cue: A Mostly-Forgotten Channel

Polarization-aware imaging is mature in microscopy, fragile in mobile. The paper reviews polarimetric phones (Sony IMX-class with pixel polarizers), specular separation (PolDepth), and shape-from-polarization. Filed as a watchlist item: not deployed, of interest.

polarimetric imaging
PolDepth
shape-from-polarization
Sony IMX
specular separation

RP-020MMXXVI · iv · 22

λ·IIReconstruction & Radiance Fields

On-Device Splatting: A 2026 Status Note

Real-time on-device 3DGS rendering crossed the 30 fps threshold on flagship-class mobile NPUs in late 2025. The paper sketches the bottlenecks that remain — splat sort, alpha compositing, memory bandwidth on tiled GPUs — and the techniques that have been tried (hierarchical splat LODs, view-frustum culling, adaptive radius). It does not describe shipped lab pipelines.

3DGS
hierarchical LOD
view-frustum culling
tiled GPU
alpha compositing

RP-058MMXXVI · iv · 18

λ·VIPrivacy & On-device ML

On-Device LLMs and the Memory-Cost Frontier

Sub-2B parameter LLMs (Phi-3.5-mini, Gemma-2B, Llama-3.2-1B, Qwen-2.5-1.5B) now fit on flagship phones with 1–3 GB of RAM. The paper reports tokens-per-second under thermal load, INT4 quantization, and KV-cache compression. The privacy benefit is substantial; the latency cost is not yet zero.

Phi-3.5
Gemma-2B
Llama-3.2
Qwen-2.5
KV-cache

RP-009MMXXVI · iv · 05

λ·ILocalization & SLAM

Drift Compensation by Magnetic Anomaly: A Field Note

Indoor magnetic fields are highly non-uniform and surprisingly stable — a fingerprint that is reproducible to ~30 cm in steel-framed buildings. The paper reviews the magnetic-fingerprint literature (IndoorAtlas-style approaches, learned magnetic descriptors) and reports a fusion experiment combining VIO with a continuous magnetic prior. The fusion adds 2 ms of compute per frame and reduces long-session drift in featureless corridors by ~38%.

magnetic fingerprinting
VIO
Kalman filter
IndoorAtlas-style
particle filter

RP-049MMXXVI · iv · 01

λ·VMultimodal & Vision-Language

Cross-Modal Distillation: VLM → Detector → Tracker

VLMs are increasingly used as teachers for downstream task-specific models. The paper reviews CLIP→detector distillation (RegionCLIP, GLIP), VLM→segmentation (LISA), and VLM→tracker pipelines. A note: cross-modal distillation is more sensitive to the prompt distribution than to the teacher's underlying quality.

RegionCLIP
GLIP
LISA
PaliGemma
distillation

RP-090MMXXVI · iii · 30

λ·XCompute & Inference Systems

On-Device LLM Inference: KV-Cache, Speculative Decoding, and the Latency Wall

On-device LLM inference at >10 tok/s on flagship phones is now routine for sub-2B models; latency at first-token remains the dominant user-perceived delay. The paper reviews KV-cache compression, speculative decoding, and prompt prefix caching. Findings: prefix caching is the highest-leverage trick for interactive AR.

KV-cache
speculative decoding
prompt cache
FlashAttention
MQA

RP-039MMXXVI · iii · 26

λ·IVEmbodied RL & World Models

Eval Beyond Success Rate: Trajectory-Quality Metrics for Spatial Agents

Success rate is a thin metric for spatial agents. The paper proposes auxiliary metrics — path efficiency, exploration coverage, obstacle margin, time-to-first-error — and reports them on three open benchmarks. A digest of how published work would have changed under richer evaluation is included.

SPL
exploration coverage
DTW
path efficiency
trajectory metrics

RP-019MMXXVI · iii · 20

λ·IIReconstruction & Radiance Fields

From Splats to Geometry: The Mesh-Extraction Problem

3D Gaussian Splatting renders beautifully but produces clouds, not surfaces. The paper reviews mesh-extraction strategies (SuGaR, 2DGS, GS2Mesh, Poisson reconstruction over splat centroids). Each pays a different fidelity tax. The lab notes the open problem of producing watertight, edited, semantically labeled meshes from a splat cloud at interactive rates.

SuGaR
2DGS
GS2Mesh
Poisson reconstruction
marching cubes

RP-075MMXXVI · iii · 12

λ·VIIIInteraction, Hand & Display

Voice as Modal Glue in Spatial UI

Voice is the connective tissue of spatial UI when both hands are committed. The paper reviews on-device wake-word detection, intent recognition, and the design tension between push-to-talk and always-listening modes. Findings: ambient voice mode is rarely justified.

wake word
intent recognition
push-to-talk
VAD
ASR

RP-029MMXXVI · iii · 08

λ·IIIOn-device Vision

Hand Pose: 21 Keypoints and the Occlusion Cliff

Mobile hand-pose models (MediaPipe Hands, RTMPose-Hand, HandFormer) achieve sub-pixel accuracy on isolated hands. Two-hand interaction collapses accuracy by 30–50% under self-occlusion. The paper reviews occlusion-aware training data (InterHand2.6M, AssemblyHands) and inference-time decoupling.

MediaPipe Hands
RTMPose-Hand
HandFormer
InterHand2.6M
AssemblyHands

RP-008MMXXVI · ii · 27

λ·ILocalization & SLAM

Cooperative Anchors: When Two Devices See the Same Room

When two devices hold anchors in the same physical room, their anchor poses can be aligned by visual co-observation, ICP on partial reconstructions, or by exchanging encrypted descriptor sketches. The paper enumerates threat models for each. The lab finds that observation-only alignment (no descriptor exchange) is feasible to ~3 cm registration on shared planar surfaces. Privacy considerations are discussed in companion paper RP-061.

ICP
visual co-observation
descriptor sketch
Procrustes alignment
shared anchors

RP-067MMXXVI · ii · 22

λ·VIISensors, VIO & Calibration

Calibrating Multi-Camera Phones: Stereo, Wide, Tele

Modern phones host 2–4 cameras with different intrinsics. The paper reviews multi-camera factory calibration, ad-hoc stereo from wide/tele pairs, and learned cross-camera depth (DUSt3R, MASt3R). Findings: learned methods are now competitive with classical stereo on close-range subjects.

DUSt3R
MASt3R
stereo
Zhang's method
checkerboard cal

RP-048MMXXVI · ii · 15

λ·VMultimodal & Vision-Language

Multimodal Agents and the Gesture Channel

Beyond text and image, multimodal agents are beginning to absorb gesture as a query channel. The paper reviews tap-on-image, point-on-image, and free-hand input as VLM prompts. Pointing tokens (Molmo-style) are evaluated as a substrate for spatial gesture.

pointing tokens
tap-on-image
Molmo
GUI agents
spatial prompts

RP-083MMXXVI · ii · 09

λ·IXRendering & Optics

Foveated Rendering on Mobile: A 2026 Snapshot

Fixed-foveated rendering is shipping on mobile-class HMDs; eye-tracked foveation requires gaze and is rarer. The paper reports compute savings on representative scenes and notes the strong dependency on the foveation kernel design. Aggressive foveation creates visible boundaries during head motion.

foveated rendering
eye-tracked foveation
VRS
QCom Adreno
latency

RP-038MMXXVI · ii · 04

λ·IVEmbodied RL & World Models

Latent Imagination at Phone-Class Compute

Running a world-model rollout at interactive rates on a phone NPU is feasible only with aggressive latent compression. The paper sketches the trade-offs between rollout horizon, latent dimensionality, and rollout count under a 16 ms budget. The lab does not deploy world-model rollouts in production; the survey is for completeness.

DreamerV3
latent dynamics
rollout horizon
NPU inference
MPC

RP-028MMXXVI · i · 30

λ·IIIOn-device Vision

Pose Estimation: Top-Down vs. Bottom-Up at Phone Scale

Human pose estimation on mobile splits into top-down (HRNet-class) and bottom-up (OpenPose-class) families. The paper benchmarks on a 30-subject home-fitness dataset and finds bottom-up wins under occlusion; top-down wins on isolated subjects. Latency budgets force a hybrid in practice.

HRNet
OpenPose
MoveNet
MediaPipe Pose
RTMPose

RP-089MMXXVI · i · 26

λ·XCompute & Inference Systems

Compiler Stacks for On-Device ML in 2026

Mobile ML compiler stacks have consolidated: Core ML (Apple), QNN (Qualcomm), TensorFlow Lite + XNNPACK, ONNX Runtime, and PyTorch ExecuTorch. The paper benchmarks the same model across each and reports the effort cost of supporting all paths. Findings: ONNX Runtime is the most predictable; vendor-specific paths are the fastest.

Core ML
QNN
TFLite
ONNX Runtime
ExecuTorch

RP-007MMXXVI · i · 19

λ·ILocalization & SLAM

Multi-Floor Traversals and the Vertical-Axis Problem

Modern SLAM frontends assume locally planar motion, which collapses on stairwells. The paper reviews three corrections — explicit floor estimation, IMU-only altitude during stair regions, and barometric anchoring — across a 14-building dataset. A simple fusion of barometric altitude with visual scale recovery yields the best vertical accuracy at the cost of barometer warm-up.

floor segmentation
IMU integration
barometric pressure
visual scale
VINS

RP-057MMXXVI · i · 11

λ·VIPrivacy & On-device ML

Differentially Private SfM: Can a Map Be Made Without a Photograph?

Recent work on privacy-preserving structure-from-motion (DP-SfM, line-cloud SfM, hashed-feature SfM) attempts to map a space without retaining recoverable imagery. The paper reviews where these methods succeed (sparse texture, repetitive environments) and where they collapse (fine detail, dynamic objects).

DP-SfM
line-cloud SfM
hashed feature
encrypted SfM
MPC SfM

RP-018MMXXVI · i · 08

λ·IIReconstruction & Radiance Fields

Implicit vs. Explicit: Where the Field Has Settled

By early 2026 the architectural debate between implicit (NeRF-style) and explicit (point-, splat-, voxel-based) representations has cooled. Explicit representations win on speed, editability, and storage; implicit representations win on continuous derivatives and surface extraction. The paper argues neither is a winner; the production answer is a translation step between the two on demand.

NeRF
3DGS
voxel grid
marching cubes
neural-explicit hybrid

RP-074MMXXV · xii · 30

λ·VIIIInteraction, Hand & Display

Comfort and Cybersickness: A Practitioner Index

AR comfort literature is large and contradictory. The paper synthesizes a short index: vection cues, vergence-accommodation conflict, FoV transitions, and display-latency variance. Each is mapped to a measurable display or render parameter.

cybersickness
vection
vergence-accommodation
FoV transition
SSQ

RP-047MMXXV · xii · 21

λ·VMultimodal & Vision-Language

Captioning the Room: Where Models Lie Confidently

VLM hallucinations are predictable on indoor scenes: confident misnaming of generic furniture, fabricated text on labels, and hallucinated counts. The paper proposes a hallucination-detection protocol using consistency over multiple crops and entropy thresholds.

hallucination detection
VQA consistency
entropy thresholding
POPE
CHAIR

RP-037MMXXV · xii · 18

λ·IVEmbodied RL & World Models

Hierarchical Policies and the Long-Horizon Problem

Long-horizon embodied tasks (clean the room, prepare the workspace) defeat flat policies. The paper reviews hierarchical RL (FuN, options framework, HiTUT) and language-as-policy approaches (SayCan, Code-as-Policies, VoxPoser). The temptation to use an LLM as the high-level controller is acknowledged; failure modes are catalogued.

hierarchical RL
FuN
options
SayCan
VoxPoser

RP-088MMXXV · xii · 15

λ·XCompute & Inference Systems

Thermal Throttling and Sustained Workloads

Smartphone SoCs sustain peak compute for 30–90 seconds before thermal throttling forces a step-down. The paper measures the throttle curve under a representative AR workload (camera + detection + render at 30 fps) and reports the steady-state compute envelope. Steady-state is roughly 60% of peak.

thermal throttle
SoC sustained
DVFS
battery drain
AR workload

RP-017MMXXV · xii · 11

λ·IIReconstruction & Radiance Fields

Dynamic Scene Reconstruction: People in the Frame

Most reconstruction pipelines assume the scene is static. People in the frame appear as ghost geometry. The paper inventories solutions: explicit human masks (Mask-RCNN, SAM-class), motion-segmentation, neural decomposition (D-NeRF, K-Planes-Dynamic). Notes a finding that two-second masking pre-pass costs less compute than letting the optimizer fight ghost residuals.

Mask-RCNN
SAM
D-NeRF
K-Planes-Dynamic
motion segmentation

RP-066MMXXV · xii · 09

λ·VIISensors, VIO & Calibration

Thermal Drift in Phone IMUs: A Long Session Note

Phone IMUs drift with chassis temperature. The paper reports thermal-drift profiles from a 60-minute heating session and proposes online thermal-bias estimation. Worth filing: drift is non-monotonic across SoC throttle events.

thermal drift
IMU bias
online estimation
SoC throttle
EKF

RP-027MMXXV · xii · 04

λ·IIIOn-device Vision

Open-Vocabulary Detection: Owl, GroundingDINO, YOLO-World

Open-vocabulary detection accepts a free-text query and returns boxes. Owl-ViT, GroundingDINO, YOLO-World, and DETR-class variants converge on similar capabilities at very different cost points. The paper indexes per-class accuracy on furniture, signage, and tools. The query-language design is found to dominate downstream behavior.

Owl-ViT
GroundingDINO
YOLO-World
DETR
open-vocab

RP-082MMXXV · xi · 26

λ·IXRendering & Optics

Display Pipelines: ATW, Reprojection, and the Late Latch

Render-on-warp (ATW), late-latch poses, and asynchronous reprojection are the three rendering tricks that hide latency on AR displays. The paper reviews each and reports the residual visible artifacts at 90 Hz vs. 120 Hz refresh. Higher refresh narrows the artifact window non-linearly.

ATW
late-latch
asynchronous reprojection
vsync
double buffering

RP-056MMXXV · xi · 19

λ·VIPrivacy & On-device ML

Anchor Privacy: What a Persistent Anchor Reveals About a Room

A persistent anchor is, in principle, a sparse pose; in practice the descriptor cloud surrounding it can be inverted to recover textured geometry. The paper reviews descriptor inversion attacks (TURF, MapAttack, Pittinverse) and counter-measures (descriptor truncation, randomized features, encrypted anchors).

descriptor inversion
TURF
MapAttack
encrypted anchor
feature noise

RP-006MMXXV · xi · 04

λ·ILocalization & SLAM

Re-localization in the Dark: Low-Light SLAM Failure Modes

Below ~10 lux, descriptor extractors based on FAST/BRIEF degrade rapidly; learned descriptors (SuperPoint, R2D2, DISK) hold longer but consume an order of magnitude more inference. The lab notes a counterintuitive observation: increasing exposure helps the front-end but hurts the back-end (motion blur dominates re-projection error). A short discussion of event-camera complements is included.

FAST
BRIEF
SuperPoint
R2D2
DISK

RP-046MMXXV · x · 26

λ·VMultimodal & Vision-Language

Long-Context Video Understanding: A 2025 Snapshot

Video-VLMs (Video-LLaVA, VideoChat2, LongVA, LLaVA-OneVision) extend single-frame reasoning across temporal context. The paper benchmarks them on indoor activity-recognition and event-localization tasks. Performance drops sharply past 30-second windows; positional-encoding tricks help only modestly.

Video-LLaVA
VideoChat2
LongVA
LLaVA-OneVision
RoPE

RP-073MMXXV · x · 21

λ·VIIIInteraction, Hand & Display

Latency Budgets for Glasses-Class AR

Motion-to-photon latency below ~20 ms is the threshold for comfortable AR. The paper decomposes the budget across capture, detection, pose update, render, and display. The current bottleneck on mobile-class compute is the detection-and-pose stage; the display is rarely the limiting factor.

motion-to-photon
frame budget
render-on-warp
ATW
latency

RP-026MMXXV · x · 15

λ·IIIOn-device Vision

Quantization-Aware Training in 2025: INT8, INT4, and the Calibration Tax

INT8 post-training quantization (PTQ) is a solved problem for most CNN-class detectors; INT4 is not. The paper reviews QAT strategies (LSQ, OmniQuant, GPTQ-class) and reports calibration-set sensitivity. A finding worth filing: detector heads and backbone sometimes prefer different bit widths.

LSQ
OmniQuant
GPTQ
AWQ
PTQ

RP-036MMXXV · x · 09

λ·IVEmbodied RL & World Models

Imitation Learning for Spatial Tasks: BC, IQL, and the Gold-Standard Demo

Behavior cloning (BC) remains the simplest imitation-learning baseline; conservative offline-RL methods (CQL, IQL, AWAC) close the gap to online RL when demonstrations are scarce. The paper reports per-task efficiency on a 200-demo handheld interior dataset. Demo quality dominates demo quantity.

BC
IQL
CQL
AWAC
DAgger

RP-016MMXXV · x · 02

λ·IIReconstruction & Radiance Fields

Scene Layout from a Single Image: Manhattan and Beyond

Recovering a coarse scene layout — walls, floor, ceiling, openings — from a single RGB frame is a strongly assumed problem. Methods that assume Manhattan-world geometry (LayoutNet, HorizonNet, AtlantaNet) fail on non-orthogonal architectures common in older buildings. The paper notes a lab finding: the failure manifold of these methods correlates with construction era more than it correlates with image quality.

LayoutNet
HorizonNet
AtlantaNet
Manhattan-world
RoomNet

RP-065MMXXV · ix · 25

λ·VIISensors, VIO & Calibration

Depth Sensors: The LiDAR-Class Comeback

Time-of-flight (ToF) and structured-light sensors on mobile (iPad Pro / iPhone Pro LiDAR-class, Android ToF) provide low-resolution but absolute-scale depth. The paper reviews their use in VIO scale recovery, mesh seeding, and people-segmentation. Conclusion: helpful, not transformative.

LiDAR
ToF
structured light
depth sensor
scale recovery

RP-005MMXXV · ix · 12

λ·ILocalization & SLAM

Loop Closure Without a Prior: A Pose-Graph Note

When the device returns to a previously visited region without a place-recognition prior, the residual error per loop accumulates against the inverse covariance of the back-end pose graph. Sparse bundle adjustment (SBA) and incremental smoothing (iSAM2) handle this in classical pipelines, but the paper notes both struggle with degenerate motion (pure rotation) common in headset use. A factor-graph perspective with on-manifold updates is sketched.

pose graph
SBA
iSAM2
GTSAM
on-manifold optimization

RP-081MMXXV · ix · 11

λ·IXRendering & Optics

Reflection Probes and Mirror Surfaces

Mirror surfaces in AR scenes are routinely mishandled. The paper reviews reflection-probe methods, screen-space reflections (SSR), and ray-marched reflections at mobile rates. Findings: the gap between physically correct and acceptable is wide; users tolerate qualitative reflections.

SSR
reflection probe
ray marching
BRDF
real-time GI

RP-055MMXXV · ix · 08

λ·VIPrivacy & On-device ML

Voice and Microphone Hygiene in AR Sessions

AR sessions frequently keep the microphone open. The paper reviews on-device wake-word detection (Porcupine-class, RNN-T-class), VAD with explicit user signaling, and audio hashing for command recognition without transcript retention. The trade-off is intent latency vs. retained audio.

wake-word detection
VAD
RNN-T
audio hashing
on-device ASR

RP-025MMXXV · viii · 29

λ·IIIOn-device Vision

Detection Under Domain Shift: Interior vs. Exterior

Object detectors trained on COCO underperform on interiors by 12–18 mAP points. The paper reviews fine-tuning, domain randomization, and synthetic data (SUN-RGBD, Hypersim, ScanNet) as remediation. Synthetic data narrows but does not close the gap; the missing factor is texture diversity, not geometry.

COCO
SUN-RGBD
Hypersim
ScanNet
domain randomization

RP-035MMXXV · viii · 21

λ·IVEmbodied RL & World Models

Reward Shaping in Spatial Tasks: A Cautionary Note

Reward shaping in 3D navigation is a perennial source of policy pathology. The paper catalogs shaping bugs that have appeared in published work (loop-back exploits, time-pressure compensation, distance-only rewards rewarding wall-hugging). A defensive checklist is offered.

reward shaping
PPO
potential-based shaping
intrinsic motivation
RND

RP-087MMXXV · viii · 19

λ·XCompute & Inference Systems

Frame Budget Engineering for AR Apps

Hitting 30 fps on phone-class compute is an exercise in subtraction. The paper digests the typical frame: capture, ISP, detection, segmentation, pose, render, composite, display. Each stage has a budget; the paper reports the typical envelope on flagship 2025 devices.

frame budget
ISP latency
render-thread
GPU contention
thermal throttle

RP-015MMXXV · viii · 17

λ·IIReconstruction & Radiance Fields

Photogrammetry at Phone Scale: Lighting Failure Modes

Photogrammetric reconstruction from handheld phone footage fails in three ways: specular surfaces, transparent surfaces, and uniform-textured walls. The paper reviews mitigations (multi-view photometric stereo, polarimetric capture, learned depth priors) and notes that none is a complete answer. A taxonomy of interior failure cases by surface type is provided as a lookup.

photogrammetry
photometric stereo
polarimetric capture
learned priors
MVS

RP-072MMXXV · viii · 12

λ·VIIIInteraction, Hand & Display

Optical See-Through Displays: A 2025 Snapshot

Waveguide optics, geometric optics, and free-form combiners coexist in 2025-era HMDs. The paper indexes field-of-view, eye-relief, eyebox, and ambient-light tolerance. Worth filing: no current display class clears 60° FoV at sustained day-bright contrast.

waveguide
geometric combiner
free-form optic
FoV
ambient contrast

RP-045MMXXV · viii · 07

λ·VMultimodal & Vision-Language

Tool Use in Vision-Language Pipelines

VLMs have been wired into tool-use pipelines (function calling, structured output, JSON-mode). The paper reviews indoor tasks that benefit (calibration, measurement, captioning a known scene) and tasks that do not (real-time tracking). A short note on output-token budgeting under interactive constraints is included.

function calling
JSON mode
ReAct
tool use
structured output

RP-080MMXXV · vii · 29

λ·IXRendering & Optics

Light Estimation: From Spherical Harmonics to LightStages

Real-time light estimation in AR is dominated by spherical harmonic regression from the live frame. The paper compares ARKit's environment textures, PointAR-style estimators, and learned HDR-from-LDR methods. A note: harsh point lights remain difficult.

spherical harmonics
ARKit environment
PointAR
HDR estimation
light probe

RP-054MMXXV · vii · 22

λ·VIPrivacy & On-device ML

Face Blurring: When the Detector Is the Privacy Mechanism

Real-time face blurring on AR streams depends on the underlying face detector. The paper reviews YuNet, BlazeFace, RetinaFace-Mobile, and Mediapipe Face Detector under occlusion, profile views, and low light. The privacy guarantee is bounded by the detector's recall, not its precision.

YuNet
BlazeFace
RetinaFace
MediaPipe Face
privacy filter

RP-064MMXXV · vii · 18

λ·VIISensors, VIO & Calibration

Magnetic Sensors as a Sixth Channel

The magnetometer is the most under-used sensor on a smartphone. Indoor magnetic anomalies are stable and globally unique. The paper reviews magnetic-fingerprint navigation (IndoorAtlas, learned magnetic embeddings) and discusses why this signal is not yet a default in VIO.

magnetometer
magnetic fingerprint
IndoorAtlas
EKF
particle filter

RP-004MMXXV · vii · 03

λ·ILocalization & SLAM

Learned Place Recognition: NetVLAD's Successors

NetVLAD remains the load-bearing place-recognition baseline despite being seven years old. Successors — Patch-NetVLAD, MixVPR, MegaLoc, AnyLoc — improve recall on changing-condition benchmarks (day/night, summer/winter) but the gain narrows on the indoor case. The paper hypothesizes that interior featurelessness defeats the assumption that landmarks are stable. A ranked digest with per-method memory footprint and inference cost is included for practitioners.

NetVLAD
Patch-NetVLAD
MixVPR
MegaLoc
AnyLoc

RP-024MMXXV · vi · 25

λ·IIIOn-device Vision

Monocular Depth on the CPU: A Cold Look

Mobile depth estimation runs primarily on the NPU but degrades gracefully to CPU when the NPU is contended (camera ISP under load, video encode). The paper profiles MiDaS-Small, ZoeDepth-NK, and Depth Anything-Mobile under CPU-only conditions. Latency variance dominates; mean latency is the wrong metric.

MiDaS-Small
ZoeDepth
Depth Anything
CPU inference
latency p99

RP-044MMXXV · vi · 18

λ·VMultimodal & Vision-Language

Multimodal Retrieval: When the Query Is the Room

Retrieval-augmented generation (RAG) has been adapted to spatial corpora. The paper reviews CLIP-feature retrieval, multi-vector retrieval (ColBERT-style), and hybrid sparse-dense retrieval over interior image stores. A finding worth filing: spatial retrieval benefits more from coarse layout features than from fine descriptors.

CLIP retrieval
ColBERT
BM25
RAG
multi-vector

RP-071MMXXV · vi · 14

λ·VIIIInteraction, Hand & Display

Haptic Feedback in AR: When the Phone Replaces the Glove

Mobile haptics — Taptic Engine, Android haptic API — provide a thin but useful feedback channel for AR confirmations. The paper reviews haptic-pattern design language (sharp vs. soft, single vs. paired) and reports user-detection thresholds at typical phone-holding postures.

Taptic Engine
Android haptic API
tactile pattern
vibrotactile
psychophysics

RP-014MMXXV · vi · 08

λ·IIReconstruction & Radiance Fields

Depth Anything v2 and the Great Depth Backbone Shift

Universal monocular-depth backbones (MiDaS, DPT, Depth Anything, ZoeDepth, Marigold) have moved from research curiosities to default scene-understanding components. The paper reports per-pixel scale-invariant error on indoor benchmarks and notes the backbone of choice now changes monthly. A short discussion of distillation paths to mobile is included.

MiDaS
DPT
Depth Anything v2
ZoeDepth
Marigold

RP-034MMXXV · vi · 04

λ·IVEmbodied RL & World Models

JEPA, V-JEPA, and the Self-Supervised Vision Wave

Joint Embedding Predictive Architectures (JEPA) propose self-supervised learning by predicting in a learned representation rather than pixel space. V-JEPA extended this to video. The paper compares JEPA-class methods to MAE, DINOv2, and contrastive baselines on transfer to navigation and detection. Findings are mixed; the regime where JEPA wins is narrower than headline claims.

JEPA
V-JEPA
MAE
DINOv2
contrastive learning

RP-053MMXXV · v · 30

λ·VIPrivacy & On-device ML

Differential Privacy for Spatial Data: ε, δ, and the Floor Plan

Differential privacy guarantees on spatial data are weaker than the literature implies. The paper reviews local-DP, central-DP, and shuffle-DP applied to room-scale telemetry. A finding: floor plans leak through low-ε mechanisms more than the worst-case theory suggests; ε ≤ 1 is required for meaningful protection.

differential privacy
local DP
shuffle DP
Gaussian mechanism
Laplace mechanism

RP-003MMXXV · v · 21

λ·ILocalization & SLAM

GNSS-Denied Indoor: When the Phone Stops Believing the Sky

Inside steel-framed buildings, GNSS pseudo-ranges drift faster than the IMU bias. The Kalman filter must be told to stop trusting the satellite stream. The paper reviews fault-detection-and-exclusion (FDE) heuristics, magnetometer cross-checks, and barometric altimetry as supplementary signals. A finding worth filing: pedestrian dead-reckoning (PDR) augmented by step-length estimation outperforms naive IMU integration by a factor of four on multi-floor traversals.

GNSS FDE
IMU integration
PDR
magnetometer
barometric altimetry

RP-063MMXXV · v · 20

λ·VIISensors, VIO & Calibration

Rolling-Shutter Compensation in Mobile VIO

Most mobile cameras use rolling shutter; classical VIO assumes global shutter. The paper reviews per-row pose interpolation, learned RS undistortion (DeepRS, RS-NeRF), and the residual error after either. Findings: per-row interpolation suffices below 1 m/s motion; learned methods help on faster motion.

rolling shutter
DeepRS
RS-NeRF
BASALT-RS
per-row interpolation

RP-079MMXXV · v · 13

λ·IXRendering & Optics

Differentiable Rendering for AR: An Index

Differentiable rendering (Mitsuba 3, nvdiffrast, GAN-style implicit renderers) underpins most modern view synthesis. The paper indexes them by feature: physically-based, mesh, splat, neural. The lab uses these for ablation rather than runtime; runtime renderers are simpler.

Mitsuba 3
nvdiffrast
PyTorch3D
differentiable rendering
BRDF

RP-023MMXXV · v · 07

λ·IIIOn-device Vision

Real-Time Instance Segmentation: Beyond YOLO-Seg

Instance segmentation on mobile is dominated by YOLO-seg variants for cost reasons, but the gap to research-grade transformer methods (Mask2Former, MaskDINO) remains visible. The paper inventories where the gap matters (occluded thin structures) and where it does not (large convex objects).

YOLO-seg
Mask2Former
MaskDINO
OneFormer
Cascade-RCNN

RP-086MMXXV · v · 04

λ·XCompute & Inference Systems

INT8 vs. FP16: Where the Crossover Sits in 2025

INT8 quantization is now the default for mobile inference; FP16 remains common for newer NPU silicon with native FP16 paths. The paper reports per-op latency for representative kernels and notes that the crossover depends on the operator mix more than on the hardware.

INT8
FP16
BF16
operator fusion
tensor-core

RP-013MMXXV · iv · 29

λ·IIReconstruction & Radiance Fields

Monocular Mesh-from-Video: A Practitioner Digest

Recovering a watertight mesh from a single moving camera is now feasible offline; doing so in real time on a phone is not. The paper reviews COLMAP-class structure-from-motion, Atlas-class learned reconstruction, NeuralRecon, MonoSDF, and SimpleRecon. Findings: learned methods are faster but hallucinate textureless walls; classical methods are slower but honest about uncertainty.

COLMAP
NeuralRecon
MonoSDF
SimpleRecon
Atlas

RP-043MMXXV · iv · 23

λ·VMultimodal & Vision-Language

Small Vision-Language Models: Phi-Vision, MiniCPM-V, Idefics-Mobile

Sub-3B parameter VLMs are now within the inference budget of high-end mobile devices. The paper benchmarks Phi-Vision, MiniCPM-V, Idefics-Mobile, and Moondream on indoor caption-and-ground tasks. Findings: small VLMs are surprisingly capable on naming, mediocre on counting, and weak on spatial reasoning.

Phi-Vision
MiniCPM-V
Idefics
Moondream
VLM benchmark

RP-070MMXXV · iv · 18

λ·VIIIInteraction, Hand & Display

Phone-as-Pointer: A Pre-Glasses Interaction Note

Before head-worn AR is universal, the phone is the pointer. The paper reviews ARKit/ARCore pointer abstractions, ray-casting from the phone center, and laser-pointer-style interaction. The dominant failure mode is fatigue under long sessions.

ARKit ray
ARCore
ray casting
pointer interaction
Fitts' law

RP-033MMXXV · iv · 15

λ·IVEmbodied RL & World Models

Sim-to-Real Without a Real Robot: Phone-Held Embodied Agents

Most embodied-agent literature assumes a robot. A handheld device acting as a 'partial agent' (the human is the actuator) inverts the problem: perception is rich, motor primitives are gestures, success is user-confirmed. The paper sketches an evaluation protocol for handheld embodied tasks (find the door, follow the path, locate the tool).

sim-to-real
Habitat-Matterport
embodied AI
imitation learning
policy gradient

RP-002MMXXV · iv · 08

λ·ILocalization & SLAM

Anchor Durability Across Sessions: A Survey of Persistence Strategies

Persistent anchoring across application launches is approached three ways in current systems: re-localize against a stored sparse map, re-localize against a learned descriptor cloud, or re-localize against a server-side spatial graph. Each pays a different price. Sparse maps are small and brittle to lighting; descriptor clouds are robust to lighting but heavy on storage; server graphs centralize what should be local. The paper enumerates the trade space and notes none of the three handles soft furnishings — the most common interior change.

sparse map
NetVLAD descriptors
MegaLoc
ARKit Persistent Anchors
spatial graph

RP-052MMXXV · iv · 04

λ·VIPrivacy & On-device ML

Federated Learning at Phone Scale: A Reality Check

Federated learning (FL) has been demonstrated at scale by major platforms (Gboard, Siri-style on-device updates). The paper reviews FedAvg, FedProx, SCAFFOLD, and client-drift mitigation. Findings: FL works for narrow tasks (next-word prediction, wake-word) and remains brittle for spatial tasks where client distributions differ wildly.

FedAvg
FedProx
SCAFFOLD
client drift
federated learning

RP-078MMXXV · iii · 26

λ·IXRendering & Optics

Shadow Estimation for AR Objects

Synthetic objects without shadows look pasted on. The paper reviews learned shadow synthesis (Shadow-AR, ShadowGAN), classical shadow-casting from a single light estimate, and probe-based environmental capture. Findings: a single dominant light estimate suffices for most interior scenes.

shadow synthesis
Shadow-AR
light estimation
environment probe
AR rendering

RP-022MMXXV · iii · 19

λ·IIIOn-device Vision

Segment Anything at Phone Scale: SAM, MobileSAM, EfficientSAM

SAM redefined what zero-shot segmentation could do. Its mobile descendants — MobileSAM, EfficientSAM, EdgeSAM — recover most of the quality at a fraction of the cost. The paper reports per-prompt latency and IoU on indoor classes. A short note: SAM-class models are *too* general for tracked-object pipelines; a tighter prompt budget is essential.

SAM
MobileSAM
EfficientSAM
EdgeSAM
IoU

RP-012MMXXV · iii · 15

λ·IIReconstruction & Radiance Fields

NeRF in 2025: From Vanilla to Instant-NGP and Beyond

Neural Radiance Fields (NeRF) defined a generation of view-synthesis work. Five years on, the field has bifurcated: hash-grid encodings (Instant-NGP) for speed; tensor decompositions (TensoRF, K-Planes) for memory; and discrete primitives (3DGS) for both. The paper treats this as a Pareto front rather than a winner-takes-all, and indexes which method dominates which corner of the (training-time, render-time, memory) cube.

NeRF
Instant-NGP
TensoRF
K-Planes
3DGS

RP-042MMXXV · iii · 11

λ·VMultimodal & Vision-Language

Grounding the Frame: From CLIP to GroundingDINO to Molmo

Grounding — answering 'where is X in this image?' — has matured rapidly. The paper compares CLIP-based zero-shot pointing, GroundingDINO box outputs, GLIP, KOSMOS-2 grounded captioning, and Molmo's pointing tokens. Findings: pointing-token methods are surprisingly accurate for a one-shot output but lossy under occlusion.

CLIP
GroundingDINO
GLIP
KOSMOS-2
Molmo

RP-062MMXXV · iii · 04

λ·VIISensors, VIO & Calibration

Camera-IMU Time Synchronization: A Persistent Tax

Sub-millisecond camera-IMU time alignment is a precondition for tight VIO. The paper reviews hardware-level synchronization, software estimation (Kalibr-style), and learned correction. A finding: published methods assume static alignment; in practice phones drift across thermal cycles.

Kalibr
VIO synchronization
thermal drift
OpenVINS
BASALT

RP-032MMXXV · ii · 26

λ·IVEmbodied RL & World Models

Indoor Navigation Policies: From PointGoal to ObjectNav

Habitat and AI2-THOR have anchored embodied-agent research for half a decade. The paper benchmarks recurrent (LSTM-based) and transformer-based policies on PointGoal and ObjectNav, and reports a sim-to-real gap that has narrowed but not closed. Real-world deployment fails for reasons (clutter, thin obstacles, glass) that simulation under-models.

Habitat
AI2-THOR
PointGoal
ObjectNav
transformer policy

RP-051MMXXV · ii · 19

λ·VIPrivacy & On-device ML

On-Device by Default: A Position Note

The lab's posture is that spatial data — room geometry, faces, voice — is processed on device unless the operator explicitly opts in to upload. The paper reviews the technical cost of this position (model size, compute envelope, storage) against the cost of the alternative (network egress, server-side retention, breach surface).

on-device ML
Core ML
TensorFlow Lite
ONNX Runtime
edge inference

RP-001MMXXV · ii · 14

λ·ILocalization & SLAM

Drift Persists: Re-evaluating Visual-Inertial Odometry on Long Sessions

Across the visual-inertial odometry literature the headline accuracy figure is computed over sessions under three minutes. The lab finds error compounds non-linearly past the ten-minute mark in featureless interiors. The dominant failure mode is not gyroscope bias but loss of feature continuity during fast yaw; loop-closure rescues are rare without a place-recognition prior. The paper contrasts MSCKF, OKVIS, and VINS-Mono against an extended ORB-SLAM3 baseline on a 90-minute interior walkthrough.

VIO
MSCKF
OKVIS
VINS-Mono
ORB-SLAM3

RP-069MMXXV · ii · 11

λ·VIIIInteraction, Hand & Display

Gaze, Pinch, and the Glasses-Era Interaction Vocabulary

Hands-free spatial UI is converging on a small vocabulary: gaze for targeting, pinch for selection, palm rotations for scaling, and voice for naming. The paper reviews input studies from major HMD platforms and notes the surprising consistency of pinch-as-select across vendors.

gaze tracking
pinch detection
spatial UI
Fitts' law
hand pose

RP-085MMXXV · ii · 08

λ·XCompute & Inference Systems

Mobile NPU Inventory: Apple ANE, Qualcomm Hexagon, Tensor TPU

The mobile NPU landscape is a three-horse race in early 2025: Apple's Neural Engine, Qualcomm's Hexagon NPU, and Google's Tensor TPU. The paper benchmarks representative networks (YOLOv8n, MobileSAM, MiDaS-Small, Phi-3.5-mini) across each. Findings: the gap is narrower than vendor claims, wider than developer experience suggests.

ANE
Hexagon NPU
Tensor TPU
Core ML
QNN

RP-021MMXXV · ii · 03

λ·IIIOn-device Vision

YOLO at the Edge: From v8 to v11 on Mobile NPUs

The YOLO family has continued its quiet competence streak through versions 8–11. The paper benchmarks v8n, v9c, v10s, and v11n at 320×320 on three mobile NPU classes. Findings: v11n offers the strongest cost/accuracy ratio for interior-object detection; v8n remains the most predictable in latency variance. Not all gains generalize away from COCO.

YOLOv8
YOLOv9
YOLOv10
YOLOv11
COCO benchmark

RP-041MMXXV · i · 28

λ·VMultimodal & Vision-Language

Vision-Language Models for the Room: A Field Survey

VLMs (CLIP, BLIP-2, LLaVA, Qwen-VL, InternVL, Florence-2) have absorbed a generation of separate captioning, VQA, and grounding models. The paper inventories indoor-scene tasks where VLMs now dominate (object naming, attribute extraction, scene description) and tasks where they still underperform (spatial relations, counting, fine pose).

CLIP
BLIP-2
LLaVA
Qwen-VL
InternVL

RP-011MMXXV · i · 22

λ·IIReconstruction & Radiance Fields

Gaussian Splatting at the Edge: A Compute Inventory

3D Gaussian Splatting (3DGS) reconstructs scenes as a cloud of anisotropic Gaussians and renders them at real-time rates on desktop GPUs. The paper inventories what is needed to deliver the same on mobile NPUs: a 4–10× reduction in splat count, INT8 quantization of opacity and SH coefficients, and a tiled rasterizer. The lab notes the open problem is not rendering but training: optimization currently demands 30+ minutes on consumer hardware.

3DGS
spherical harmonics
tiled rasterization
INT8 quantization
anisotropic Gaussians

RP-077MMXXV · i · 15

λ·IXRendering & Optics

Compositing the Real and the Synthetic

AR rendering is fundamentally a compositing problem: the synthetic image must match the live image in exposure, color, and motion blur. The paper reviews real-time tone mapping, color-matching neural networks, and motion-blur transfer. The match is rarely perfect; user tolerance for the gap is wider than literature claims.

compositing
tone mapping
color match
motion blur
ACES

RP-031MMXXV · i · 12

λ·IVEmbodied RL & World Models

World Models in 2025: DreamerV3 and Beyond

World models — agents that learn a latent forward dynamics model and plan inside it — have re-entered the mainstream after DreamerV3 demonstrated cross-domain generalization. The paper reviews the V1–V3 progression, IRIS, TWM, and the JEPA family. The transition from pixel-space to latent-space planning is treated as the field's central inflection.

DreamerV3
IRIS
TWM
JEPA
world model

RP-061MMXXV · i · 08

λ·VIISensors, VIO & Calibration

IMU Calibration in 2025: Stationary, Allan Variance, and Online

IMU calibration splits into factory-stationary methods (Allan variance for noise characterization), in-situ stationary refinement, and online calibration during use. The paper reviews each and notes that online calibration has matured to where dedicated stationary calibration adds little for consumer devices. Allan-variance work remains essential for sensor selection.

Allan variance
IMU bias
online calibration
factory cal
Kalibr