VOIR · RESEARCH REGISTRY · OPEN CORPUS

A reading of the field.

Eighty-eight short papers on the techniques and instruments shaping augmented reality, immersive computing, and on-device perception. The lab observes; the lab summarizes; the lab declines to disclose how the methods are combined behind the wall.

VOLUME
88 entries
WINDOW
2025 – 2026
DOMAINS
nine
FORMAT
survey · digest · field note

88 of 88 entries shown

RP-050MMXXVI · v · 14
λ·VMultimodal & Vision-Language

When the Model Should Not Speak: Refusal in Spatial AI

VLMs generate fluent text whether or not the underlying perception supports it. The paper examines refusal training and uncertainty-aware decoding for spatial tasks. The lab argues that interactive AR favors *short refusals* over long confabulations, and proposes a per-token entropy gate.

  • refusal training
  • uncertainty quantification
  • entropy gating
  • abstention
  • calibration
RP-010MMXXVI · v · 12
λ·ILocalization & SLAM

The Cold-Start Problem in Persistent AR

First-launch re-localization is the most user-hostile moment in persistent AR: nothing is known about the room until the device sweeps it. The paper digests four cold-start strategies: dense priming sweeps, gravity-aligned mini-maps, server-side coarse priors, and IMU-only blind tracking until the first feature lock. None is satisfying alone; the lab argues for a hybrid with explicit user signaling on the first session.

  • cold start
  • gravity-aligned map
  • blind tracking
  • feature priming
  • VIO
RP-040MMXXVI · v · 09
λ·IVEmbodied RL & World Models

When Not to Use RL: A Decision Note

Reinforcement learning is tempting for spatial tasks but is rarely the cheapest answer. The paper catalogues five tasks where the lab considered RL and chose classical control or imitation instead, with rationale. A decision tree is offered: if the reward is dense, the demonstrations exist, and the simulator is faithful, RL is reasonable; otherwise, it is not.

  • classical control
  • MPC
  • imitation learning
  • decision tree
  • RL
RP-076MMXXVI · v · 06
λ·VIIIInteraction, Hand & Display

Eye Tracking on Mobile: The Unsolved Case

Eye tracking on phone front cameras is feasible at coarse resolution; gaze prediction within a held device is dominated by head pose, not eyeball pose. The paper reviews mobile-gaze methods (GazeCapture, OpenGaze) and their characteristic failure modes.

  • GazeCapture
  • OpenGaze
  • EyeNet
  • appearance-based gaze
  • calibration-free gaze
RP-030MMXXVI · v · 02
λ·IIIOn-device Vision

Distillation Pipelines: Teacher → Student → Edge

Knowledge distillation has become a default tool for shipping research-grade models to mobile. The paper reviews logit distillation, feature distillation, attention transfer, and self-distillation. A finding: distillation works best when the student architecture is similar to the teacher; cross-family distillation is brittle.

  • logit distillation
  • feature distillation
  • attention transfer
  • self-distillation
  • DistilBERT-style
RP-084MMXXVI · iv · 30
λ·IXRendering & Optics

Volumetric Rendering at Phone Frame Rates

Volumetric AR effects (smoke, light shafts, transparency) remain expensive on mobile. The paper indexes density-grid rendering, sparse voxel approaches (NSVF), and signed-distance volumes. Findings: small volumes are cheap; large volumes are not yet tractable.

  • density grid
  • NSVF
  • SDF rendering
  • ray marching
  • volume rendering
RP-068MMXXVI · iv · 25
λ·VIISensors, VIO & Calibration

Polarization as a Cue: A Mostly-Forgotten Channel

Polarization-aware imaging is mature in microscopy, fragile in mobile. The paper reviews polarimetric phones (Sony IMX-class with pixel polarizers), specular separation (PolDepth), and shape-from-polarization. Filed as a watchlist item: not deployed, of interest.

  • polarimetric imaging
  • PolDepth
  • shape-from-polarization
  • Sony IMX
  • specular separation
RP-020MMXXVI · iv · 22
λ·IIReconstruction & Radiance Fields

On-Device Splatting: A 2026 Status Note

Real-time on-device 3DGS rendering crossed the 30 fps threshold on flagship-class mobile NPUs in late 2025. The paper sketches the bottlenecks that remain — splat sort, alpha compositing, memory bandwidth on tiled GPUs — and the techniques that have been tried (hierarchical splat LODs, view-frustum culling, adaptive radius). It does not describe shipped lab pipelines.

  • 3DGS
  • hierarchical LOD
  • view-frustum culling
  • tiled GPU
  • alpha compositing
RP-058MMXXVI · iv · 18
λ·VIPrivacy & On-device ML

On-Device LLMs and the Memory-Cost Frontier

Sub-2B parameter LLMs (Phi-3.5-mini, Gemma-2B, Llama-3.2-1B, Qwen-2.5-1.5B) now fit on flagship phones with 1–3 GB of RAM. The paper reports tokens-per-second under thermal load, INT4 quantization, and KV-cache compression. The privacy benefit is substantial; the latency cost is not yet zero.

  • Phi-3.5
  • Gemma-2B
  • Llama-3.2
  • Qwen-2.5
  • KV-cache
RP-009MMXXVI · iv · 05
λ·ILocalization & SLAM

Drift Compensation by Magnetic Anomaly: A Field Note

Indoor magnetic fields are highly non-uniform and surprisingly stable — a fingerprint that is reproducible to ~30 cm in steel-framed buildings. The paper reviews the magnetic-fingerprint literature (IndoorAtlas-style approaches, learned magnetic descriptors) and reports a fusion experiment combining VIO with a continuous magnetic prior. The fusion adds 2 ms of compute per frame and reduces long-session drift in featureless corridors by ~38%.

  • magnetic fingerprinting
  • VIO
  • Kalman filter
  • IndoorAtlas-style
  • particle filter
RP-049MMXXVI · iv · 01
λ·VMultimodal & Vision-Language

Cross-Modal Distillation: VLM → Detector → Tracker

VLMs are increasingly used as teachers for downstream task-specific models. The paper reviews CLIP→detector distillation (RegionCLIP, GLIP), VLM→segmentation (LISA), and VLM→tracker pipelines. A note: cross-modal distillation is more sensitive to the prompt distribution than to the teacher's underlying quality.

  • RegionCLIP
  • GLIP
  • LISA
  • PaliGemma
  • distillation
RP-090MMXXVI · iii · 30
λ·XCompute & Inference Systems

On-Device LLM Inference: KV-Cache, Speculative Decoding, and the Latency Wall

On-device LLM inference at >10 tok/s on flagship phones is now routine for sub-2B models; latency at first-token remains the dominant user-perceived delay. The paper reviews KV-cache compression, speculative decoding, and prompt prefix caching. Findings: prefix caching is the highest-leverage trick for interactive AR.

  • KV-cache
  • speculative decoding
  • prompt cache
  • FlashAttention
  • MQA
RP-039MMXXVI · iii · 26
λ·IVEmbodied RL & World Models

Eval Beyond Success Rate: Trajectory-Quality Metrics for Spatial Agents

Success rate is a thin metric for spatial agents. The paper proposes auxiliary metrics — path efficiency, exploration coverage, obstacle margin, time-to-first-error — and reports them on three open benchmarks. A digest of how published work would have changed under richer evaluation is included.

  • SPL
  • exploration coverage
  • DTW
  • path efficiency
  • trajectory metrics
RP-019MMXXVI · iii · 20
λ·IIReconstruction & Radiance Fields

From Splats to Geometry: The Mesh-Extraction Problem

3D Gaussian Splatting renders beautifully but produces clouds, not surfaces. The paper reviews mesh-extraction strategies (SuGaR, 2DGS, GS2Mesh, Poisson reconstruction over splat centroids). Each pays a different fidelity tax. The lab notes the open problem of producing watertight, edited, semantically labeled meshes from a splat cloud at interactive rates.

  • SuGaR
  • 2DGS
  • GS2Mesh
  • Poisson reconstruction
  • marching cubes
RP-075MMXXVI · iii · 12
λ·VIIIInteraction, Hand & Display

Voice as Modal Glue in Spatial UI

Voice is the connective tissue of spatial UI when both hands are committed. The paper reviews on-device wake-word detection, intent recognition, and the design tension between push-to-talk and always-listening modes. Findings: ambient voice mode is rarely justified.

  • wake word
  • intent recognition
  • push-to-talk
  • VAD
  • ASR
RP-029MMXXVI · iii · 08
λ·IIIOn-device Vision

Hand Pose: 21 Keypoints and the Occlusion Cliff

Mobile hand-pose models (MediaPipe Hands, RTMPose-Hand, HandFormer) achieve sub-pixel accuracy on isolated hands. Two-hand interaction collapses accuracy by 30–50% under self-occlusion. The paper reviews occlusion-aware training data (InterHand2.6M, AssemblyHands) and inference-time decoupling.

  • MediaPipe Hands
  • RTMPose-Hand
  • HandFormer
  • InterHand2.6M
  • AssemblyHands
RP-008MMXXVI · ii · 27
λ·ILocalization & SLAM

Cooperative Anchors: When Two Devices See the Same Room

When two devices hold anchors in the same physical room, their anchor poses can be aligned by visual co-observation, ICP on partial reconstructions, or by exchanging encrypted descriptor sketches. The paper enumerates threat models for each. The lab finds that observation-only alignment (no descriptor exchange) is feasible to ~3 cm registration on shared planar surfaces. Privacy considerations are discussed in companion paper RP-061.

  • ICP
  • visual co-observation
  • descriptor sketch
  • Procrustes alignment
  • shared anchors
RP-067MMXXVI · ii · 22
λ·VIISensors, VIO & Calibration

Calibrating Multi-Camera Phones: Stereo, Wide, Tele

Modern phones host 2–4 cameras with different intrinsics. The paper reviews multi-camera factory calibration, ad-hoc stereo from wide/tele pairs, and learned cross-camera depth (DUSt3R, MASt3R). Findings: learned methods are now competitive with classical stereo on close-range subjects.

  • DUSt3R
  • MASt3R
  • stereo
  • Zhang's method
  • checkerboard cal
RP-048MMXXVI · ii · 15
λ·VMultimodal & Vision-Language

Multimodal Agents and the Gesture Channel

Beyond text and image, multimodal agents are beginning to absorb gesture as a query channel. The paper reviews tap-on-image, point-on-image, and free-hand input as VLM prompts. Pointing tokens (Molmo-style) are evaluated as a substrate for spatial gesture.

  • pointing tokens
  • tap-on-image
  • Molmo
  • GUI agents
  • spatial prompts
RP-083MMXXVI · ii · 09
λ·IXRendering & Optics

Foveated Rendering on Mobile: A 2026 Snapshot

Fixed-foveated rendering is shipping on mobile-class HMDs; eye-tracked foveation requires gaze and is rarer. The paper reports compute savings on representative scenes and notes the strong dependency on the foveation kernel design. Aggressive foveation creates visible boundaries during head motion.

  • foveated rendering
  • eye-tracked foveation
  • VRS
  • QCom Adreno
  • latency
RP-038MMXXVI · ii · 04
λ·IVEmbodied RL & World Models

Latent Imagination at Phone-Class Compute

Running a world-model rollout at interactive rates on a phone NPU is feasible only with aggressive latent compression. The paper sketches the trade-offs between rollout horizon, latent dimensionality, and rollout count under a 16 ms budget. The lab does not deploy world-model rollouts in production; the survey is for completeness.

  • DreamerV3
  • latent dynamics
  • rollout horizon
  • NPU inference
  • MPC
RP-028MMXXVI · i · 30
λ·IIIOn-device Vision

Pose Estimation: Top-Down vs. Bottom-Up at Phone Scale

Human pose estimation on mobile splits into top-down (HRNet-class) and bottom-up (OpenPose-class) families. The paper benchmarks on a 30-subject home-fitness dataset and finds bottom-up wins under occlusion; top-down wins on isolated subjects. Latency budgets force a hybrid in practice.

  • HRNet
  • OpenPose
  • MoveNet
  • MediaPipe Pose
  • RTMPose
RP-089MMXXVI · i · 26
λ·XCompute & Inference Systems

Compiler Stacks for On-Device ML in 2026

Mobile ML compiler stacks have consolidated: Core ML (Apple), QNN (Qualcomm), TensorFlow Lite + XNNPACK, ONNX Runtime, and PyTorch ExecuTorch. The paper benchmarks the same model across each and reports the effort cost of supporting all paths. Findings: ONNX Runtime is the most predictable; vendor-specific paths are the fastest.

  • Core ML
  • QNN
  • TFLite
  • ONNX Runtime
  • ExecuTorch
RP-007MMXXVI · i · 19
λ·ILocalization & SLAM

Multi-Floor Traversals and the Vertical-Axis Problem

Modern SLAM frontends assume locally planar motion, which collapses on stairwells. The paper reviews three corrections — explicit floor estimation, IMU-only altitude during stair regions, and barometric anchoring — across a 14-building dataset. A simple fusion of barometric altitude with visual scale recovery yields the best vertical accuracy at the cost of barometer warm-up.

  • floor segmentation
  • IMU integration
  • barometric pressure
  • visual scale
  • VINS
RP-057MMXXVI · i · 11
λ·VIPrivacy & On-device ML

Differentially Private SfM: Can a Map Be Made Without a Photograph?

Recent work on privacy-preserving structure-from-motion (DP-SfM, line-cloud SfM, hashed-feature SfM) attempts to map a space without retaining recoverable imagery. The paper reviews where these methods succeed (sparse texture, repetitive environments) and where they collapse (fine detail, dynamic objects).

  • DP-SfM
  • line-cloud SfM
  • hashed feature
  • encrypted SfM
  • MPC SfM
RP-018MMXXVI · i · 08
λ·IIReconstruction & Radiance Fields

Implicit vs. Explicit: Where the Field Has Settled

By early 2026 the architectural debate between implicit (NeRF-style) and explicit (point-, splat-, voxel-based) representations has cooled. Explicit representations win on speed, editability, and storage; implicit representations win on continuous derivatives and surface extraction. The paper argues neither is a winner; the production answer is a translation step between the two on demand.

  • NeRF
  • 3DGS
  • voxel grid
  • marching cubes
  • neural-explicit hybrid
RP-074MMXXV · xii · 30
λ·VIIIInteraction, Hand & Display

Comfort and Cybersickness: A Practitioner Index

AR comfort literature is large and contradictory. The paper synthesizes a short index: vection cues, vergence-accommodation conflict, FoV transitions, and display-latency variance. Each is mapped to a measurable display or render parameter.

  • cybersickness
  • vection
  • vergence-accommodation
  • FoV transition
  • SSQ
RP-047MMXXV · xii · 21
λ·VMultimodal & Vision-Language

Captioning the Room: Where Models Lie Confidently

VLM hallucinations are predictable on indoor scenes: confident misnaming of generic furniture, fabricated text on labels, and hallucinated counts. The paper proposes a hallucination-detection protocol using consistency over multiple crops and entropy thresholds.

  • hallucination detection
  • VQA consistency
  • entropy thresholding
  • POPE
  • CHAIR
RP-037MMXXV · xii · 18
λ·IVEmbodied RL & World Models

Hierarchical Policies and the Long-Horizon Problem

Long-horizon embodied tasks (clean the room, prepare the workspace) defeat flat policies. The paper reviews hierarchical RL (FuN, options framework, HiTUT) and language-as-policy approaches (SayCan, Code-as-Policies, VoxPoser). The temptation to use an LLM as the high-level controller is acknowledged; failure modes are catalogued.

  • hierarchical RL
  • FuN
  • options
  • SayCan
  • VoxPoser
RP-088MMXXV · xii · 15
λ·XCompute & Inference Systems

Thermal Throttling and Sustained Workloads

Smartphone SoCs sustain peak compute for 30–90 seconds before thermal throttling forces a step-down. The paper measures the throttle curve under a representative AR workload (camera + detection + render at 30 fps) and reports the steady-state compute envelope. Steady-state is roughly 60% of peak.

  • thermal throttle
  • SoC sustained
  • DVFS
  • battery drain
  • AR workload
RP-017MMXXV · xii · 11
λ·IIReconstruction & Radiance Fields

Dynamic Scene Reconstruction: People in the Frame

Most reconstruction pipelines assume the scene is static. People in the frame appear as ghost geometry. The paper inventories solutions: explicit human masks (Mask-RCNN, SAM-class), motion-segmentation, neural decomposition (D-NeRF, K-Planes-Dynamic). Notes a finding that two-second masking pre-pass costs less compute than letting the optimizer fight ghost residuals.

  • Mask-RCNN
  • SAM
  • D-NeRF
  • K-Planes-Dynamic
  • motion segmentation
RP-066MMXXV · xii · 09
λ·VIISensors, VIO & Calibration

Thermal Drift in Phone IMUs: A Long Session Note

Phone IMUs drift with chassis temperature. The paper reports thermal-drift profiles from a 60-minute heating session and proposes online thermal-bias estimation. Worth filing: drift is non-monotonic across SoC throttle events.

  • thermal drift
  • IMU bias
  • online estimation
  • SoC throttle
  • EKF
RP-027MMXXV · xii · 04
λ·IIIOn-device Vision

Open-Vocabulary Detection: Owl, GroundingDINO, YOLO-World

Open-vocabulary detection accepts a free-text query and returns boxes. Owl-ViT, GroundingDINO, YOLO-World, and DETR-class variants converge on similar capabilities at very different cost points. The paper indexes per-class accuracy on furniture, signage, and tools. The query-language design is found to dominate downstream behavior.

  • Owl-ViT
  • GroundingDINO
  • YOLO-World
  • DETR
  • open-vocab
RP-082MMXXV · xi · 26
λ·IXRendering & Optics

Display Pipelines: ATW, Reprojection, and the Late Latch

Render-on-warp (ATW), late-latch poses, and asynchronous reprojection are the three rendering tricks that hide latency on AR displays. The paper reviews each and reports the residual visible artifacts at 90 Hz vs. 120 Hz refresh. Higher refresh narrows the artifact window non-linearly.

  • ATW
  • late-latch
  • asynchronous reprojection
  • vsync
  • double buffering
RP-056MMXXV · xi · 19
λ·VIPrivacy & On-device ML

Anchor Privacy: What a Persistent Anchor Reveals About a Room

A persistent anchor is, in principle, a sparse pose; in practice the descriptor cloud surrounding it can be inverted to recover textured geometry. The paper reviews descriptor inversion attacks (TURF, MapAttack, Pittinverse) and counter-measures (descriptor truncation, randomized features, encrypted anchors).

  • descriptor inversion
  • TURF
  • MapAttack
  • encrypted anchor
  • feature noise
RP-006MMXXV · xi · 04
λ·ILocalization & SLAM

Re-localization in the Dark: Low-Light SLAM Failure Modes

Below ~10 lux, descriptor extractors based on FAST/BRIEF degrade rapidly; learned descriptors (SuperPoint, R2D2, DISK) hold longer but consume an order of magnitude more inference. The lab notes a counterintuitive observation: increasing exposure helps the front-end but hurts the back-end (motion blur dominates re-projection error). A short discussion of event-camera complements is included.

  • FAST
  • BRIEF
  • SuperPoint
  • R2D2
  • DISK
RP-046MMXXV · x · 26
λ·VMultimodal & Vision-Language

Long-Context Video Understanding: A 2025 Snapshot

Video-VLMs (Video-LLaVA, VideoChat2, LongVA, LLaVA-OneVision) extend single-frame reasoning across temporal context. The paper benchmarks them on indoor activity-recognition and event-localization tasks. Performance drops sharply past 30-second windows; positional-encoding tricks help only modestly.

  • Video-LLaVA
  • VideoChat2
  • LongVA
  • LLaVA-OneVision
  • RoPE
RP-073MMXXV · x · 21
λ·VIIIInteraction, Hand & Display

Latency Budgets for Glasses-Class AR

Motion-to-photon latency below ~20 ms is the threshold for comfortable AR. The paper decomposes the budget across capture, detection, pose update, render, and display. The current bottleneck on mobile-class compute is the detection-and-pose stage; the display is rarely the limiting factor.

  • motion-to-photon
  • frame budget
  • render-on-warp
  • ATW
  • latency
RP-026MMXXV · x · 15
λ·IIIOn-device Vision

Quantization-Aware Training in 2025: INT8, INT4, and the Calibration Tax

INT8 post-training quantization (PTQ) is a solved problem for most CNN-class detectors; INT4 is not. The paper reviews QAT strategies (LSQ, OmniQuant, GPTQ-class) and reports calibration-set sensitivity. A finding worth filing: detector heads and backbone sometimes prefer different bit widths.

  • LSQ
  • OmniQuant
  • GPTQ
  • AWQ
  • PTQ
RP-036MMXXV · x · 09
λ·IVEmbodied RL & World Models

Imitation Learning for Spatial Tasks: BC, IQL, and the Gold-Standard Demo

Behavior cloning (BC) remains the simplest imitation-learning baseline; conservative offline-RL methods (CQL, IQL, AWAC) close the gap to online RL when demonstrations are scarce. The paper reports per-task efficiency on a 200-demo handheld interior dataset. Demo quality dominates demo quantity.

  • BC
  • IQL
  • CQL
  • AWAC
  • DAgger
RP-016MMXXV · x · 02
λ·IIReconstruction & Radiance Fields

Scene Layout from a Single Image: Manhattan and Beyond

Recovering a coarse scene layout — walls, floor, ceiling, openings — from a single RGB frame is a strongly assumed problem. Methods that assume Manhattan-world geometry (LayoutNet, HorizonNet, AtlantaNet) fail on non-orthogonal architectures common in older buildings. The paper notes a lab finding: the failure manifold of these methods correlates with construction era more than it correlates with image quality.

  • LayoutNet
  • HorizonNet
  • AtlantaNet
  • Manhattan-world
  • RoomNet
RP-065MMXXV · ix · 25
λ·VIISensors, VIO & Calibration

Depth Sensors: The LiDAR-Class Comeback

Time-of-flight (ToF) and structured-light sensors on mobile (iPad Pro / iPhone Pro LiDAR-class, Android ToF) provide low-resolution but absolute-scale depth. The paper reviews their use in VIO scale recovery, mesh seeding, and people-segmentation. Conclusion: helpful, not transformative.

  • LiDAR
  • ToF
  • structured light
  • depth sensor
  • scale recovery
RP-005MMXXV · ix · 12
λ·ILocalization & SLAM

Loop Closure Without a Prior: A Pose-Graph Note

When the device returns to a previously visited region without a place-recognition prior, the residual error per loop accumulates against the inverse covariance of the back-end pose graph. Sparse bundle adjustment (SBA) and incremental smoothing (iSAM2) handle this in classical pipelines, but the paper notes both struggle with degenerate motion (pure rotation) common in headset use. A factor-graph perspective with on-manifold updates is sketched.

  • pose graph
  • SBA
  • iSAM2
  • GTSAM
  • on-manifold optimization
RP-081MMXXV · ix · 11
λ·IXRendering & Optics

Reflection Probes and Mirror Surfaces

Mirror surfaces in AR scenes are routinely mishandled. The paper reviews reflection-probe methods, screen-space reflections (SSR), and ray-marched reflections at mobile rates. Findings: the gap between physically correct and acceptable is wide; users tolerate qualitative reflections.

  • SSR
  • reflection probe
  • ray marching
  • BRDF
  • real-time GI
RP-055MMXXV · ix · 08
λ·VIPrivacy & On-device ML

Voice and Microphone Hygiene in AR Sessions

AR sessions frequently keep the microphone open. The paper reviews on-device wake-word detection (Porcupine-class, RNN-T-class), VAD with explicit user signaling, and audio hashing for command recognition without transcript retention. The trade-off is intent latency vs. retained audio.

  • wake-word detection
  • VAD
  • RNN-T
  • audio hashing
  • on-device ASR
RP-025MMXXV · viii · 29
λ·IIIOn-device Vision

Detection Under Domain Shift: Interior vs. Exterior

Object detectors trained on COCO underperform on interiors by 12–18 mAP points. The paper reviews fine-tuning, domain randomization, and synthetic data (SUN-RGBD, Hypersim, ScanNet) as remediation. Synthetic data narrows but does not close the gap; the missing factor is texture diversity, not geometry.

  • COCO
  • SUN-RGBD
  • Hypersim
  • ScanNet
  • domain randomization
RP-035MMXXV · viii · 21
λ·IVEmbodied RL & World Models

Reward Shaping in Spatial Tasks: A Cautionary Note

Reward shaping in 3D navigation is a perennial source of policy pathology. The paper catalogs shaping bugs that have appeared in published work (loop-back exploits, time-pressure compensation, distance-only rewards rewarding wall-hugging). A defensive checklist is offered.

  • reward shaping
  • PPO
  • potential-based shaping
  • intrinsic motivation
  • RND
RP-087MMXXV · viii · 19
λ·XCompute & Inference Systems

Frame Budget Engineering for AR Apps

Hitting 30 fps on phone-class compute is an exercise in subtraction. The paper digests the typical frame: capture, ISP, detection, segmentation, pose, render, composite, display. Each stage has a budget; the paper reports the typical envelope on flagship 2025 devices.

  • frame budget
  • ISP latency
  • render-thread
  • GPU contention
  • thermal throttle
RP-015MMXXV · viii · 17
λ·IIReconstruction & Radiance Fields

Photogrammetry at Phone Scale: Lighting Failure Modes

Photogrammetric reconstruction from handheld phone footage fails in three ways: specular surfaces, transparent surfaces, and uniform-textured walls. The paper reviews mitigations (multi-view photometric stereo, polarimetric capture, learned depth priors) and notes that none is a complete answer. A taxonomy of interior failure cases by surface type is provided as a lookup.

  • photogrammetry
  • photometric stereo
  • polarimetric capture
  • learned priors
  • MVS
RP-072MMXXV · viii · 12
λ·VIIIInteraction, Hand & Display

Optical See-Through Displays: A 2025 Snapshot

Waveguide optics, geometric optics, and free-form combiners coexist in 2025-era HMDs. The paper indexes field-of-view, eye-relief, eyebox, and ambient-light tolerance. Worth filing: no current display class clears 60° FoV at sustained day-bright contrast.

  • waveguide
  • geometric combiner
  • free-form optic
  • FoV
  • ambient contrast
RP-045MMXXV · viii · 07
λ·VMultimodal & Vision-Language

Tool Use in Vision-Language Pipelines

VLMs have been wired into tool-use pipelines (function calling, structured output, JSON-mode). The paper reviews indoor tasks that benefit (calibration, measurement, captioning a known scene) and tasks that do not (real-time tracking). A short note on output-token budgeting under interactive constraints is included.

  • function calling
  • JSON mode
  • ReAct
  • tool use
  • structured output
RP-080MMXXV · vii · 29
λ·IXRendering & Optics

Light Estimation: From Spherical Harmonics to LightStages

Real-time light estimation in AR is dominated by spherical harmonic regression from the live frame. The paper compares ARKit's environment textures, PointAR-style estimators, and learned HDR-from-LDR methods. A note: harsh point lights remain difficult.

  • spherical harmonics
  • ARKit environment
  • PointAR
  • HDR estimation
  • light probe
RP-054MMXXV · vii · 22
λ·VIPrivacy & On-device ML

Face Blurring: When the Detector Is the Privacy Mechanism

Real-time face blurring on AR streams depends on the underlying face detector. The paper reviews YuNet, BlazeFace, RetinaFace-Mobile, and Mediapipe Face Detector under occlusion, profile views, and low light. The privacy guarantee is bounded by the detector's recall, not its precision.

  • YuNet
  • BlazeFace
  • RetinaFace
  • MediaPipe Face
  • privacy filter
RP-064MMXXV · vii · 18
λ·VIISensors, VIO & Calibration

Magnetic Sensors as a Sixth Channel

The magnetometer is the most under-used sensor on a smartphone. Indoor magnetic anomalies are stable and globally unique. The paper reviews magnetic-fingerprint navigation (IndoorAtlas, learned magnetic embeddings) and discusses why this signal is not yet a default in VIO.

  • magnetometer
  • magnetic fingerprint
  • IndoorAtlas
  • EKF
  • particle filter
RP-004MMXXV · vii · 03
λ·ILocalization & SLAM

Learned Place Recognition: NetVLAD's Successors

NetVLAD remains the load-bearing place-recognition baseline despite being seven years old. Successors — Patch-NetVLAD, MixVPR, MegaLoc, AnyLoc — improve recall on changing-condition benchmarks (day/night, summer/winter) but the gain narrows on the indoor case. The paper hypothesizes that interior featurelessness defeats the assumption that landmarks are stable. A ranked digest with per-method memory footprint and inference cost is included for practitioners.

  • NetVLAD
  • Patch-NetVLAD
  • MixVPR
  • MegaLoc
  • AnyLoc
RP-024MMXXV · vi · 25
λ·IIIOn-device Vision

Monocular Depth on the CPU: A Cold Look

Mobile depth estimation runs primarily on the NPU but degrades gracefully to CPU when the NPU is contended (camera ISP under load, video encode). The paper profiles MiDaS-Small, ZoeDepth-NK, and Depth Anything-Mobile under CPU-only conditions. Latency variance dominates; mean latency is the wrong metric.

  • MiDaS-Small
  • ZoeDepth
  • Depth Anything
  • CPU inference
  • latency p99
RP-044MMXXV · vi · 18
λ·VMultimodal & Vision-Language

Multimodal Retrieval: When the Query Is the Room

Retrieval-augmented generation (RAG) has been adapted to spatial corpora. The paper reviews CLIP-feature retrieval, multi-vector retrieval (ColBERT-style), and hybrid sparse-dense retrieval over interior image stores. A finding worth filing: spatial retrieval benefits more from coarse layout features than from fine descriptors.

  • CLIP retrieval
  • ColBERT
  • BM25
  • RAG
  • multi-vector
RP-071MMXXV · vi · 14
λ·VIIIInteraction, Hand & Display

Haptic Feedback in AR: When the Phone Replaces the Glove

Mobile haptics — Taptic Engine, Android haptic API — provide a thin but useful feedback channel for AR confirmations. The paper reviews haptic-pattern design language (sharp vs. soft, single vs. paired) and reports user-detection thresholds at typical phone-holding postures.

  • Taptic Engine
  • Android haptic API
  • tactile pattern
  • vibrotactile
  • psychophysics
RP-014MMXXV · vi · 08
λ·IIReconstruction & Radiance Fields

Depth Anything v2 and the Great Depth Backbone Shift

Universal monocular-depth backbones (MiDaS, DPT, Depth Anything, ZoeDepth, Marigold) have moved from research curiosities to default scene-understanding components. The paper reports per-pixel scale-invariant error on indoor benchmarks and notes the backbone of choice now changes monthly. A short discussion of distillation paths to mobile is included.

  • MiDaS
  • DPT
  • Depth Anything v2
  • ZoeDepth
  • Marigold
RP-034MMXXV · vi · 04
λ·IVEmbodied RL & World Models

JEPA, V-JEPA, and the Self-Supervised Vision Wave

Joint Embedding Predictive Architectures (JEPA) propose self-supervised learning by predicting in a learned representation rather than pixel space. V-JEPA extended this to video. The paper compares JEPA-class methods to MAE, DINOv2, and contrastive baselines on transfer to navigation and detection. Findings are mixed; the regime where JEPA wins is narrower than headline claims.

  • JEPA
  • V-JEPA
  • MAE
  • DINOv2
  • contrastive learning
RP-053MMXXV · v · 30
λ·VIPrivacy & On-device ML

Differential Privacy for Spatial Data: ε, δ, and the Floor Plan

Differential privacy guarantees on spatial data are weaker than the literature implies. The paper reviews local-DP, central-DP, and shuffle-DP applied to room-scale telemetry. A finding: floor plans leak through low-ε mechanisms more than the worst-case theory suggests; ε ≤ 1 is required for meaningful protection.

  • differential privacy
  • local DP
  • shuffle DP
  • Gaussian mechanism
  • Laplace mechanism
RP-003MMXXV · v · 21
λ·ILocalization & SLAM

GNSS-Denied Indoor: When the Phone Stops Believing the Sky

Inside steel-framed buildings, GNSS pseudo-ranges drift faster than the IMU bias. The Kalman filter must be told to stop trusting the satellite stream. The paper reviews fault-detection-and-exclusion (FDE) heuristics, magnetometer cross-checks, and barometric altimetry as supplementary signals. A finding worth filing: pedestrian dead-reckoning (PDR) augmented by step-length estimation outperforms naive IMU integration by a factor of four on multi-floor traversals.

  • GNSS FDE
  • IMU integration
  • PDR
  • magnetometer
  • barometric altimetry
RP-063MMXXV · v · 20
λ·VIISensors, VIO & Calibration

Rolling-Shutter Compensation in Mobile VIO

Most mobile cameras use rolling shutter; classical VIO assumes global shutter. The paper reviews per-row pose interpolation, learned RS undistortion (DeepRS, RS-NeRF), and the residual error after either. Findings: per-row interpolation suffices below 1 m/s motion; learned methods help on faster motion.

  • rolling shutter
  • DeepRS
  • RS-NeRF
  • BASALT-RS
  • per-row interpolation
RP-079MMXXV · v · 13
λ·IXRendering & Optics

Differentiable Rendering for AR: An Index

Differentiable rendering (Mitsuba 3, nvdiffrast, GAN-style implicit renderers) underpins most modern view synthesis. The paper indexes them by feature: physically-based, mesh, splat, neural. The lab uses these for ablation rather than runtime; runtime renderers are simpler.

  • Mitsuba 3
  • nvdiffrast
  • PyTorch3D
  • differentiable rendering
  • BRDF
RP-023MMXXV · v · 07
λ·IIIOn-device Vision

Real-Time Instance Segmentation: Beyond YOLO-Seg

Instance segmentation on mobile is dominated by YOLO-seg variants for cost reasons, but the gap to research-grade transformer methods (Mask2Former, MaskDINO) remains visible. The paper inventories where the gap matters (occluded thin structures) and where it does not (large convex objects).

  • YOLO-seg
  • Mask2Former
  • MaskDINO
  • OneFormer
  • Cascade-RCNN
RP-086MMXXV · v · 04
λ·XCompute & Inference Systems

INT8 vs. FP16: Where the Crossover Sits in 2025

INT8 quantization is now the default for mobile inference; FP16 remains common for newer NPU silicon with native FP16 paths. The paper reports per-op latency for representative kernels and notes that the crossover depends on the operator mix more than on the hardware.

  • INT8
  • FP16
  • BF16
  • operator fusion
  • tensor-core
RP-013MMXXV · iv · 29
λ·IIReconstruction & Radiance Fields

Monocular Mesh-from-Video: A Practitioner Digest

Recovering a watertight mesh from a single moving camera is now feasible offline; doing so in real time on a phone is not. The paper reviews COLMAP-class structure-from-motion, Atlas-class learned reconstruction, NeuralRecon, MonoSDF, and SimpleRecon. Findings: learned methods are faster but hallucinate textureless walls; classical methods are slower but honest about uncertainty.

  • COLMAP
  • NeuralRecon
  • MonoSDF
  • SimpleRecon
  • Atlas
RP-043MMXXV · iv · 23
λ·VMultimodal & Vision-Language

Small Vision-Language Models: Phi-Vision, MiniCPM-V, Idefics-Mobile

Sub-3B parameter VLMs are now within the inference budget of high-end mobile devices. The paper benchmarks Phi-Vision, MiniCPM-V, Idefics-Mobile, and Moondream on indoor caption-and-ground tasks. Findings: small VLMs are surprisingly capable on naming, mediocre on counting, and weak on spatial reasoning.

  • Phi-Vision
  • MiniCPM-V
  • Idefics
  • Moondream
  • VLM benchmark
RP-070MMXXV · iv · 18
λ·VIIIInteraction, Hand & Display

Phone-as-Pointer: A Pre-Glasses Interaction Note

Before head-worn AR is universal, the phone is the pointer. The paper reviews ARKit/ARCore pointer abstractions, ray-casting from the phone center, and laser-pointer-style interaction. The dominant failure mode is fatigue under long sessions.

  • ARKit ray
  • ARCore
  • ray casting
  • pointer interaction
  • Fitts' law
RP-033MMXXV · iv · 15
λ·IVEmbodied RL & World Models

Sim-to-Real Without a Real Robot: Phone-Held Embodied Agents

Most embodied-agent literature assumes a robot. A handheld device acting as a 'partial agent' (the human is the actuator) inverts the problem: perception is rich, motor primitives are gestures, success is user-confirmed. The paper sketches an evaluation protocol for handheld embodied tasks (find the door, follow the path, locate the tool).

  • sim-to-real
  • Habitat-Matterport
  • embodied AI
  • imitation learning
  • policy gradient
RP-002MMXXV · iv · 08
λ·ILocalization & SLAM

Anchor Durability Across Sessions: A Survey of Persistence Strategies

Persistent anchoring across application launches is approached three ways in current systems: re-localize against a stored sparse map, re-localize against a learned descriptor cloud, or re-localize against a server-side spatial graph. Each pays a different price. Sparse maps are small and brittle to lighting; descriptor clouds are robust to lighting but heavy on storage; server graphs centralize what should be local. The paper enumerates the trade space and notes none of the three handles soft furnishings — the most common interior change.

  • sparse map
  • NetVLAD descriptors
  • MegaLoc
  • ARKit Persistent Anchors
  • spatial graph
RP-052MMXXV · iv · 04
λ·VIPrivacy & On-device ML

Federated Learning at Phone Scale: A Reality Check

Federated learning (FL) has been demonstrated at scale by major platforms (Gboard, Siri-style on-device updates). The paper reviews FedAvg, FedProx, SCAFFOLD, and client-drift mitigation. Findings: FL works for narrow tasks (next-word prediction, wake-word) and remains brittle for spatial tasks where client distributions differ wildly.

  • FedAvg
  • FedProx
  • SCAFFOLD
  • client drift
  • federated learning
RP-078MMXXV · iii · 26
λ·IXRendering & Optics

Shadow Estimation for AR Objects

Synthetic objects without shadows look pasted on. The paper reviews learned shadow synthesis (Shadow-AR, ShadowGAN), classical shadow-casting from a single light estimate, and probe-based environmental capture. Findings: a single dominant light estimate suffices for most interior scenes.

  • shadow synthesis
  • Shadow-AR
  • light estimation
  • environment probe
  • AR rendering
RP-022MMXXV · iii · 19
λ·IIIOn-device Vision

Segment Anything at Phone Scale: SAM, MobileSAM, EfficientSAM

SAM redefined what zero-shot segmentation could do. Its mobile descendants — MobileSAM, EfficientSAM, EdgeSAM — recover most of the quality at a fraction of the cost. The paper reports per-prompt latency and IoU on indoor classes. A short note: SAM-class models are *too* general for tracked-object pipelines; a tighter prompt budget is essential.

  • SAM
  • MobileSAM
  • EfficientSAM
  • EdgeSAM
  • IoU
RP-012MMXXV · iii · 15
λ·IIReconstruction & Radiance Fields

NeRF in 2025: From Vanilla to Instant-NGP and Beyond

Neural Radiance Fields (NeRF) defined a generation of view-synthesis work. Five years on, the field has bifurcated: hash-grid encodings (Instant-NGP) for speed; tensor decompositions (TensoRF, K-Planes) for memory; and discrete primitives (3DGS) for both. The paper treats this as a Pareto front rather than a winner-takes-all, and indexes which method dominates which corner of the (training-time, render-time, memory) cube.

  • NeRF
  • Instant-NGP
  • TensoRF
  • K-Planes
  • 3DGS
RP-042MMXXV · iii · 11
λ·VMultimodal & Vision-Language

Grounding the Frame: From CLIP to GroundingDINO to Molmo

Grounding — answering 'where is X in this image?' — has matured rapidly. The paper compares CLIP-based zero-shot pointing, GroundingDINO box outputs, GLIP, KOSMOS-2 grounded captioning, and Molmo's pointing tokens. Findings: pointing-token methods are surprisingly accurate for a one-shot output but lossy under occlusion.

  • CLIP
  • GroundingDINO
  • GLIP
  • KOSMOS-2
  • Molmo
RP-062MMXXV · iii · 04
λ·VIISensors, VIO & Calibration

Camera-IMU Time Synchronization: A Persistent Tax

Sub-millisecond camera-IMU time alignment is a precondition for tight VIO. The paper reviews hardware-level synchronization, software estimation (Kalibr-style), and learned correction. A finding: published methods assume static alignment; in practice phones drift across thermal cycles.

  • Kalibr
  • VIO synchronization
  • thermal drift
  • OpenVINS
  • BASALT
RP-032MMXXV · ii · 26
λ·IVEmbodied RL & World Models

Indoor Navigation Policies: From PointGoal to ObjectNav

Habitat and AI2-THOR have anchored embodied-agent research for half a decade. The paper benchmarks recurrent (LSTM-based) and transformer-based policies on PointGoal and ObjectNav, and reports a sim-to-real gap that has narrowed but not closed. Real-world deployment fails for reasons (clutter, thin obstacles, glass) that simulation under-models.

  • Habitat
  • AI2-THOR
  • PointGoal
  • ObjectNav
  • transformer policy
RP-051MMXXV · ii · 19
λ·VIPrivacy & On-device ML

On-Device by Default: A Position Note

The lab's posture is that spatial data — room geometry, faces, voice — is processed on device unless the operator explicitly opts in to upload. The paper reviews the technical cost of this position (model size, compute envelope, storage) against the cost of the alternative (network egress, server-side retention, breach surface).

  • on-device ML
  • Core ML
  • TensorFlow Lite
  • ONNX Runtime
  • edge inference
RP-001MMXXV · ii · 14
λ·ILocalization & SLAM

Drift Persists: Re-evaluating Visual-Inertial Odometry on Long Sessions

Across the visual-inertial odometry literature the headline accuracy figure is computed over sessions under three minutes. The lab finds error compounds non-linearly past the ten-minute mark in featureless interiors. The dominant failure mode is not gyroscope bias but loss of feature continuity during fast yaw; loop-closure rescues are rare without a place-recognition prior. The paper contrasts MSCKF, OKVIS, and VINS-Mono against an extended ORB-SLAM3 baseline on a 90-minute interior walkthrough.

  • VIO
  • MSCKF
  • OKVIS
  • VINS-Mono
  • ORB-SLAM3
RP-069MMXXV · ii · 11
λ·VIIIInteraction, Hand & Display

Gaze, Pinch, and the Glasses-Era Interaction Vocabulary

Hands-free spatial UI is converging on a small vocabulary: gaze for targeting, pinch for selection, palm rotations for scaling, and voice for naming. The paper reviews input studies from major HMD platforms and notes the surprising consistency of pinch-as-select across vendors.

  • gaze tracking
  • pinch detection
  • spatial UI
  • Fitts' law
  • hand pose
RP-085MMXXV · ii · 08
λ·XCompute & Inference Systems

Mobile NPU Inventory: Apple ANE, Qualcomm Hexagon, Tensor TPU

The mobile NPU landscape is a three-horse race in early 2025: Apple's Neural Engine, Qualcomm's Hexagon NPU, and Google's Tensor TPU. The paper benchmarks representative networks (YOLOv8n, MobileSAM, MiDaS-Small, Phi-3.5-mini) across each. Findings: the gap is narrower than vendor claims, wider than developer experience suggests.

  • ANE
  • Hexagon NPU
  • Tensor TPU
  • Core ML
  • QNN
RP-021MMXXV · ii · 03
λ·IIIOn-device Vision

YOLO at the Edge: From v8 to v11 on Mobile NPUs

The YOLO family has continued its quiet competence streak through versions 8–11. The paper benchmarks v8n, v9c, v10s, and v11n at 320×320 on three mobile NPU classes. Findings: v11n offers the strongest cost/accuracy ratio for interior-object detection; v8n remains the most predictable in latency variance. Not all gains generalize away from COCO.

  • YOLOv8
  • YOLOv9
  • YOLOv10
  • YOLOv11
  • COCO benchmark
RP-041MMXXV · i · 28
λ·VMultimodal & Vision-Language

Vision-Language Models for the Room: A Field Survey

VLMs (CLIP, BLIP-2, LLaVA, Qwen-VL, InternVL, Florence-2) have absorbed a generation of separate captioning, VQA, and grounding models. The paper inventories indoor-scene tasks where VLMs now dominate (object naming, attribute extraction, scene description) and tasks where they still underperform (spatial relations, counting, fine pose).

  • CLIP
  • BLIP-2
  • LLaVA
  • Qwen-VL
  • InternVL
RP-011MMXXV · i · 22
λ·IIReconstruction & Radiance Fields

Gaussian Splatting at the Edge: A Compute Inventory

3D Gaussian Splatting (3DGS) reconstructs scenes as a cloud of anisotropic Gaussians and renders them at real-time rates on desktop GPUs. The paper inventories what is needed to deliver the same on mobile NPUs: a 4–10× reduction in splat count, INT8 quantization of opacity and SH coefficients, and a tiled rasterizer. The lab notes the open problem is not rendering but training: optimization currently demands 30+ minutes on consumer hardware.

  • 3DGS
  • spherical harmonics
  • tiled rasterization
  • INT8 quantization
  • anisotropic Gaussians
RP-077MMXXV · i · 15
λ·IXRendering & Optics

Compositing the Real and the Synthetic

AR rendering is fundamentally a compositing problem: the synthetic image must match the live image in exposure, color, and motion blur. The paper reviews real-time tone mapping, color-matching neural networks, and motion-blur transfer. The match is rarely perfect; user tolerance for the gap is wider than literature claims.

  • compositing
  • tone mapping
  • color match
  • motion blur
  • ACES
RP-031MMXXV · i · 12
λ·IVEmbodied RL & World Models

World Models in 2025: DreamerV3 and Beyond

World models — agents that learn a latent forward dynamics model and plan inside it — have re-entered the mainstream after DreamerV3 demonstrated cross-domain generalization. The paper reviews the V1–V3 progression, IRIS, TWM, and the JEPA family. The transition from pixel-space to latent-space planning is treated as the field's central inflection.

  • DreamerV3
  • IRIS
  • TWM
  • JEPA
  • world model
RP-061MMXXV · i · 08
λ·VIISensors, VIO & Calibration

IMU Calibration in 2025: Stationary, Allan Variance, and Online

IMU calibration splits into factory-stationary methods (Allan variance for noise characterization), in-situ stationary refinement, and online calibration during use. The paper reviews each and notes that online calibration has matured to where dedicated stationary calibration adds little for consumer devices. Allan-variance work remains essential for sensor selection.

  • Allan variance
  • IMU bias
  • online calibration
  • factory cal
  • Kalibr