Eighty-eight short papers on the techniques and instruments shaping augmented reality, immersive computing, and on-device perception. The lab observes; the lab summarizes; the lab declines to disclose how the methods are combined behind the wall.
VOLUME
88 entries
WINDOW
2025 – 2026
DOMAINS
nine
FORMAT
survey · digest · field note
88 of 88 entries shown
RP-050MMXXVI · v · 14
λ·VMultimodal & Vision-Language
When the Model Should Not Speak: Refusal in Spatial AI
VLMs generate fluent text whether or not the underlying perception supports it. The paper examines refusal training and uncertainty-aware decoding for spatial tasks. The lab argues that interactive AR favors *short refusals* over long confabulations, and proposes a per-token entropy gate.
refusal training
uncertainty quantification
entropy gating
abstention
calibration
RP-010MMXXVI · v · 12
λ·ILocalization & SLAM
The Cold-Start Problem in Persistent AR
First-launch re-localization is the most user-hostile moment in persistent AR: nothing is known about the room until the device sweeps it. The paper digests four cold-start strategies: dense priming sweeps, gravity-aligned mini-maps, server-side coarse priors, and IMU-only blind tracking until the first feature lock. None is satisfying alone; the lab argues for a hybrid with explicit user signaling on the first session.
cold start
gravity-aligned map
blind tracking
feature priming
VIO
RP-040MMXXVI · v · 09
λ·IVEmbodied RL & World Models
When Not to Use RL: A Decision Note
Reinforcement learning is tempting for spatial tasks but is rarely the cheapest answer. The paper catalogues five tasks where the lab considered RL and chose classical control or imitation instead, with rationale. A decision tree is offered: if the reward is dense, the demonstrations exist, and the simulator is faithful, RL is reasonable; otherwise, it is not.
classical control
MPC
imitation learning
decision tree
RL
RP-076MMXXVI · v · 06
λ·VIIIInteraction, Hand & Display
Eye Tracking on Mobile: The Unsolved Case
Eye tracking on phone front cameras is feasible at coarse resolution; gaze prediction within a held device is dominated by head pose, not eyeball pose. The paper reviews mobile-gaze methods (GazeCapture, OpenGaze) and their characteristic failure modes.
GazeCapture
OpenGaze
EyeNet
appearance-based gaze
calibration-free gaze
RP-030MMXXVI · v · 02
λ·IIIOn-device Vision
Distillation Pipelines: Teacher → Student → Edge
Knowledge distillation has become a default tool for shipping research-grade models to mobile. The paper reviews logit distillation, feature distillation, attention transfer, and self-distillation. A finding: distillation works best when the student architecture is similar to the teacher; cross-family distillation is brittle.
logit distillation
feature distillation
attention transfer
self-distillation
DistilBERT-style
RP-084MMXXVI · iv · 30
λ·IXRendering & Optics
Volumetric Rendering at Phone Frame Rates
Volumetric AR effects (smoke, light shafts, transparency) remain expensive on mobile. The paper indexes density-grid rendering, sparse voxel approaches (NSVF), and signed-distance volumes. Findings: small volumes are cheap; large volumes are not yet tractable.
density grid
NSVF
SDF rendering
ray marching
volume rendering
RP-068MMXXVI · iv · 25
λ·VIISensors, VIO & Calibration
Polarization as a Cue: A Mostly-Forgotten Channel
Polarization-aware imaging is mature in microscopy, fragile in mobile. The paper reviews polarimetric phones (Sony IMX-class with pixel polarizers), specular separation (PolDepth), and shape-from-polarization. Filed as a watchlist item: not deployed, of interest.
polarimetric imaging
PolDepth
shape-from-polarization
Sony IMX
specular separation
RP-020MMXXVI · iv · 22
λ·IIReconstruction & Radiance Fields
On-Device Splatting: A 2026 Status Note
Real-time on-device 3DGS rendering crossed the 30 fps threshold on flagship-class mobile NPUs in late 2025. The paper sketches the bottlenecks that remain — splat sort, alpha compositing, memory bandwidth on tiled GPUs — and the techniques that have been tried (hierarchical splat LODs, view-frustum culling, adaptive radius). It does not describe shipped lab pipelines.
3DGS
hierarchical LOD
view-frustum culling
tiled GPU
alpha compositing
RP-058MMXXVI · iv · 18
λ·VIPrivacy & On-device ML
On-Device LLMs and the Memory-Cost Frontier
Sub-2B parameter LLMs (Phi-3.5-mini, Gemma-2B, Llama-3.2-1B, Qwen-2.5-1.5B) now fit on flagship phones with 1–3 GB of RAM. The paper reports tokens-per-second under thermal load, INT4 quantization, and KV-cache compression. The privacy benefit is substantial; the latency cost is not yet zero.
Phi-3.5
Gemma-2B
Llama-3.2
Qwen-2.5
KV-cache
RP-009MMXXVI · iv · 05
λ·ILocalization & SLAM
Drift Compensation by Magnetic Anomaly: A Field Note
Indoor magnetic fields are highly non-uniform and surprisingly stable — a fingerprint that is reproducible to ~30 cm in steel-framed buildings. The paper reviews the magnetic-fingerprint literature (IndoorAtlas-style approaches, learned magnetic descriptors) and reports a fusion experiment combining VIO with a continuous magnetic prior. The fusion adds 2 ms of compute per frame and reduces long-session drift in featureless corridors by ~38%.
VLMs are increasingly used as teachers for downstream task-specific models. The paper reviews CLIP→detector distillation (RegionCLIP, GLIP), VLM→segmentation (LISA), and VLM→tracker pipelines. A note: cross-modal distillation is more sensitive to the prompt distribution than to the teacher's underlying quality.
RegionCLIP
GLIP
LISA
PaliGemma
distillation
RP-090MMXXVI · iii · 30
λ·XCompute & Inference Systems
On-Device LLM Inference: KV-Cache, Speculative Decoding, and the Latency Wall
On-device LLM inference at >10 tok/s on flagship phones is now routine for sub-2B models; latency at first-token remains the dominant user-perceived delay. The paper reviews KV-cache compression, speculative decoding, and prompt prefix caching. Findings: prefix caching is the highest-leverage trick for interactive AR.
KV-cache
speculative decoding
prompt cache
FlashAttention
MQA
RP-039MMXXVI · iii · 26
λ·IVEmbodied RL & World Models
Eval Beyond Success Rate: Trajectory-Quality Metrics for Spatial Agents
Success rate is a thin metric for spatial agents. The paper proposes auxiliary metrics — path efficiency, exploration coverage, obstacle margin, time-to-first-error — and reports them on three open benchmarks. A digest of how published work would have changed under richer evaluation is included.
SPL
exploration coverage
DTW
path efficiency
trajectory metrics
RP-019MMXXVI · iii · 20
λ·IIReconstruction & Radiance Fields
From Splats to Geometry: The Mesh-Extraction Problem
3D Gaussian Splatting renders beautifully but produces clouds, not surfaces. The paper reviews mesh-extraction strategies (SuGaR, 2DGS, GS2Mesh, Poisson reconstruction over splat centroids). Each pays a different fidelity tax. The lab notes the open problem of producing watertight, edited, semantically labeled meshes from a splat cloud at interactive rates.
SuGaR
2DGS
GS2Mesh
Poisson reconstruction
marching cubes
RP-075MMXXVI · iii · 12
λ·VIIIInteraction, Hand & Display
Voice as Modal Glue in Spatial UI
Voice is the connective tissue of spatial UI when both hands are committed. The paper reviews on-device wake-word detection, intent recognition, and the design tension between push-to-talk and always-listening modes. Findings: ambient voice mode is rarely justified.
wake word
intent recognition
push-to-talk
VAD
ASR
RP-029MMXXVI · iii · 08
λ·IIIOn-device Vision
Hand Pose: 21 Keypoints and the Occlusion Cliff
Mobile hand-pose models (MediaPipe Hands, RTMPose-Hand, HandFormer) achieve sub-pixel accuracy on isolated hands. Two-hand interaction collapses accuracy by 30–50% under self-occlusion. The paper reviews occlusion-aware training data (InterHand2.6M, AssemblyHands) and inference-time decoupling.
MediaPipe Hands
RTMPose-Hand
HandFormer
InterHand2.6M
AssemblyHands
RP-008MMXXVI · ii · 27
λ·ILocalization & SLAM
Cooperative Anchors: When Two Devices See the Same Room
When two devices hold anchors in the same physical room, their anchor poses can be aligned by visual co-observation, ICP on partial reconstructions, or by exchanging encrypted descriptor sketches. The paper enumerates threat models for each. The lab finds that observation-only alignment (no descriptor exchange) is feasible to ~3 cm registration on shared planar surfaces. Privacy considerations are discussed in companion paper RP-061.
Modern phones host 2–4 cameras with different intrinsics. The paper reviews multi-camera factory calibration, ad-hoc stereo from wide/tele pairs, and learned cross-camera depth (DUSt3R, MASt3R). Findings: learned methods are now competitive with classical stereo on close-range subjects.
DUSt3R
MASt3R
stereo
Zhang's method
checkerboard cal
RP-048MMXXVI · ii · 15
λ·VMultimodal & Vision-Language
Multimodal Agents and the Gesture Channel
Beyond text and image, multimodal agents are beginning to absorb gesture as a query channel. The paper reviews tap-on-image, point-on-image, and free-hand input as VLM prompts. Pointing tokens (Molmo-style) are evaluated as a substrate for spatial gesture.
pointing tokens
tap-on-image
Molmo
GUI agents
spatial prompts
RP-083MMXXVI · ii · 09
λ·IXRendering & Optics
Foveated Rendering on Mobile: A 2026 Snapshot
Fixed-foveated rendering is shipping on mobile-class HMDs; eye-tracked foveation requires gaze and is rarer. The paper reports compute savings on representative scenes and notes the strong dependency on the foveation kernel design. Aggressive foveation creates visible boundaries during head motion.
foveated rendering
eye-tracked foveation
VRS
QCom Adreno
latency
RP-038MMXXVI · ii · 04
λ·IVEmbodied RL & World Models
Latent Imagination at Phone-Class Compute
Running a world-model rollout at interactive rates on a phone NPU is feasible only with aggressive latent compression. The paper sketches the trade-offs between rollout horizon, latent dimensionality, and rollout count under a 16 ms budget. The lab does not deploy world-model rollouts in production; the survey is for completeness.
DreamerV3
latent dynamics
rollout horizon
NPU inference
MPC
RP-028MMXXVI · i · 30
λ·IIIOn-device Vision
Pose Estimation: Top-Down vs. Bottom-Up at Phone Scale
Human pose estimation on mobile splits into top-down (HRNet-class) and bottom-up (OpenPose-class) families. The paper benchmarks on a 30-subject home-fitness dataset and finds bottom-up wins under occlusion; top-down wins on isolated subjects. Latency budgets force a hybrid in practice.
HRNet
OpenPose
MoveNet
MediaPipe Pose
RTMPose
RP-089MMXXVI · i · 26
λ·XCompute & Inference Systems
Compiler Stacks for On-Device ML in 2026
Mobile ML compiler stacks have consolidated: Core ML (Apple), QNN (Qualcomm), TensorFlow Lite + XNNPACK, ONNX Runtime, and PyTorch ExecuTorch. The paper benchmarks the same model across each and reports the effort cost of supporting all paths. Findings: ONNX Runtime is the most predictable; vendor-specific paths are the fastest.
Core ML
QNN
TFLite
ONNX Runtime
ExecuTorch
RP-007MMXXVI · i · 19
λ·ILocalization & SLAM
Multi-Floor Traversals and the Vertical-Axis Problem
Modern SLAM frontends assume locally planar motion, which collapses on stairwells. The paper reviews three corrections — explicit floor estimation, IMU-only altitude during stair regions, and barometric anchoring — across a 14-building dataset. A simple fusion of barometric altitude with visual scale recovery yields the best vertical accuracy at the cost of barometer warm-up.
floor segmentation
IMU integration
barometric pressure
visual scale
VINS
RP-057MMXXVI · i · 11
λ·VIPrivacy & On-device ML
Differentially Private SfM: Can a Map Be Made Without a Photograph?
Recent work on privacy-preserving structure-from-motion (DP-SfM, line-cloud SfM, hashed-feature SfM) attempts to map a space without retaining recoverable imagery. The paper reviews where these methods succeed (sparse texture, repetitive environments) and where they collapse (fine detail, dynamic objects).
DP-SfM
line-cloud SfM
hashed feature
encrypted SfM
MPC SfM
RP-018MMXXVI · i · 08
λ·IIReconstruction & Radiance Fields
Implicit vs. Explicit: Where the Field Has Settled
By early 2026 the architectural debate between implicit (NeRF-style) and explicit (point-, splat-, voxel-based) representations has cooled. Explicit representations win on speed, editability, and storage; implicit representations win on continuous derivatives and surface extraction. The paper argues neither is a winner; the production answer is a translation step between the two on demand.
NeRF
3DGS
voxel grid
marching cubes
neural-explicit hybrid
RP-074MMXXV · xii · 30
λ·VIIIInteraction, Hand & Display
Comfort and Cybersickness: A Practitioner Index
AR comfort literature is large and contradictory. The paper synthesizes a short index: vection cues, vergence-accommodation conflict, FoV transitions, and display-latency variance. Each is mapped to a measurable display or render parameter.
cybersickness
vection
vergence-accommodation
FoV transition
SSQ
RP-047MMXXV · xii · 21
λ·VMultimodal & Vision-Language
Captioning the Room: Where Models Lie Confidently
VLM hallucinations are predictable on indoor scenes: confident misnaming of generic furniture, fabricated text on labels, and hallucinated counts. The paper proposes a hallucination-detection protocol using consistency over multiple crops and entropy thresholds.
hallucination detection
VQA consistency
entropy thresholding
POPE
CHAIR
RP-037MMXXV · xii · 18
λ·IVEmbodied RL & World Models
Hierarchical Policies and the Long-Horizon Problem
Long-horizon embodied tasks (clean the room, prepare the workspace) defeat flat policies. The paper reviews hierarchical RL (FuN, options framework, HiTUT) and language-as-policy approaches (SayCan, Code-as-Policies, VoxPoser). The temptation to use an LLM as the high-level controller is acknowledged; failure modes are catalogued.
hierarchical RL
FuN
options
SayCan
VoxPoser
RP-088MMXXV · xii · 15
λ·XCompute & Inference Systems
Thermal Throttling and Sustained Workloads
Smartphone SoCs sustain peak compute for 30–90 seconds before thermal throttling forces a step-down. The paper measures the throttle curve under a representative AR workload (camera + detection + render at 30 fps) and reports the steady-state compute envelope. Steady-state is roughly 60% of peak.
thermal throttle
SoC sustained
DVFS
battery drain
AR workload
RP-017MMXXV · xii · 11
λ·IIReconstruction & Radiance Fields
Dynamic Scene Reconstruction: People in the Frame
Most reconstruction pipelines assume the scene is static. People in the frame appear as ghost geometry. The paper inventories solutions: explicit human masks (Mask-RCNN, SAM-class), motion-segmentation, neural decomposition (D-NeRF, K-Planes-Dynamic). Notes a finding that two-second masking pre-pass costs less compute than letting the optimizer fight ghost residuals.
Mask-RCNN
SAM
D-NeRF
K-Planes-Dynamic
motion segmentation
RP-066MMXXV · xii · 09
λ·VIISensors, VIO & Calibration
Thermal Drift in Phone IMUs: A Long Session Note
Phone IMUs drift with chassis temperature. The paper reports thermal-drift profiles from a 60-minute heating session and proposes online thermal-bias estimation. Worth filing: drift is non-monotonic across SoC throttle events.
Open-vocabulary detection accepts a free-text query and returns boxes. Owl-ViT, GroundingDINO, YOLO-World, and DETR-class variants converge on similar capabilities at very different cost points. The paper indexes per-class accuracy on furniture, signage, and tools. The query-language design is found to dominate downstream behavior.
Owl-ViT
GroundingDINO
YOLO-World
DETR
open-vocab
RP-082MMXXV · xi · 26
λ·IXRendering & Optics
Display Pipelines: ATW, Reprojection, and the Late Latch
Render-on-warp (ATW), late-latch poses, and asynchronous reprojection are the three rendering tricks that hide latency on AR displays. The paper reviews each and reports the residual visible artifacts at 90 Hz vs. 120 Hz refresh. Higher refresh narrows the artifact window non-linearly.
ATW
late-latch
asynchronous reprojection
vsync
double buffering
RP-056MMXXV · xi · 19
λ·VIPrivacy & On-device ML
Anchor Privacy: What a Persistent Anchor Reveals About a Room
A persistent anchor is, in principle, a sparse pose; in practice the descriptor cloud surrounding it can be inverted to recover textured geometry. The paper reviews descriptor inversion attacks (TURF, MapAttack, Pittinverse) and counter-measures (descriptor truncation, randomized features, encrypted anchors).
descriptor inversion
TURF
MapAttack
encrypted anchor
feature noise
RP-006MMXXV · xi · 04
λ·ILocalization & SLAM
Re-localization in the Dark: Low-Light SLAM Failure Modes
Below ~10 lux, descriptor extractors based on FAST/BRIEF degrade rapidly; learned descriptors (SuperPoint, R2D2, DISK) hold longer but consume an order of magnitude more inference. The lab notes a counterintuitive observation: increasing exposure helps the front-end but hurts the back-end (motion blur dominates re-projection error). A short discussion of event-camera complements is included.
FAST
BRIEF
SuperPoint
R2D2
DISK
RP-046MMXXV · x · 26
λ·VMultimodal & Vision-Language
Long-Context Video Understanding: A 2025 Snapshot
Video-VLMs (Video-LLaVA, VideoChat2, LongVA, LLaVA-OneVision) extend single-frame reasoning across temporal context. The paper benchmarks them on indoor activity-recognition and event-localization tasks. Performance drops sharply past 30-second windows; positional-encoding tricks help only modestly.
Video-LLaVA
VideoChat2
LongVA
LLaVA-OneVision
RoPE
RP-073MMXXV · x · 21
λ·VIIIInteraction, Hand & Display
Latency Budgets for Glasses-Class AR
Motion-to-photon latency below ~20 ms is the threshold for comfortable AR. The paper decomposes the budget across capture, detection, pose update, render, and display. The current bottleneck on mobile-class compute is the detection-and-pose stage; the display is rarely the limiting factor.
motion-to-photon
frame budget
render-on-warp
ATW
latency
RP-026MMXXV · x · 15
λ·IIIOn-device Vision
Quantization-Aware Training in 2025: INT8, INT4, and the Calibration Tax
INT8 post-training quantization (PTQ) is a solved problem for most CNN-class detectors; INT4 is not. The paper reviews QAT strategies (LSQ, OmniQuant, GPTQ-class) and reports calibration-set sensitivity. A finding worth filing: detector heads and backbone sometimes prefer different bit widths.
LSQ
OmniQuant
GPTQ
AWQ
PTQ
RP-036MMXXV · x · 09
λ·IVEmbodied RL & World Models
Imitation Learning for Spatial Tasks: BC, IQL, and the Gold-Standard Demo
Behavior cloning (BC) remains the simplest imitation-learning baseline; conservative offline-RL methods (CQL, IQL, AWAC) close the gap to online RL when demonstrations are scarce. The paper reports per-task efficiency on a 200-demo handheld interior dataset. Demo quality dominates demo quantity.
BC
IQL
CQL
AWAC
DAgger
RP-016MMXXV · x · 02
λ·IIReconstruction & Radiance Fields
Scene Layout from a Single Image: Manhattan and Beyond
Recovering a coarse scene layout — walls, floor, ceiling, openings — from a single RGB frame is a strongly assumed problem. Methods that assume Manhattan-world geometry (LayoutNet, HorizonNet, AtlantaNet) fail on non-orthogonal architectures common in older buildings. The paper notes a lab finding: the failure manifold of these methods correlates with construction era more than it correlates with image quality.
LayoutNet
HorizonNet
AtlantaNet
Manhattan-world
RoomNet
RP-065MMXXV · ix · 25
λ·VIISensors, VIO & Calibration
Depth Sensors: The LiDAR-Class Comeback
Time-of-flight (ToF) and structured-light sensors on mobile (iPad Pro / iPhone Pro LiDAR-class, Android ToF) provide low-resolution but absolute-scale depth. The paper reviews their use in VIO scale recovery, mesh seeding, and people-segmentation. Conclusion: helpful, not transformative.
LiDAR
ToF
structured light
depth sensor
scale recovery
RP-005MMXXV · ix · 12
λ·ILocalization & SLAM
Loop Closure Without a Prior: A Pose-Graph Note
When the device returns to a previously visited region without a place-recognition prior, the residual error per loop accumulates against the inverse covariance of the back-end pose graph. Sparse bundle adjustment (SBA) and incremental smoothing (iSAM2) handle this in classical pipelines, but the paper notes both struggle with degenerate motion (pure rotation) common in headset use. A factor-graph perspective with on-manifold updates is sketched.
pose graph
SBA
iSAM2
GTSAM
on-manifold optimization
RP-081MMXXV · ix · 11
λ·IXRendering & Optics
Reflection Probes and Mirror Surfaces
Mirror surfaces in AR scenes are routinely mishandled. The paper reviews reflection-probe methods, screen-space reflections (SSR), and ray-marched reflections at mobile rates. Findings: the gap between physically correct and acceptable is wide; users tolerate qualitative reflections.
SSR
reflection probe
ray marching
BRDF
real-time GI
RP-055MMXXV · ix · 08
λ·VIPrivacy & On-device ML
Voice and Microphone Hygiene in AR Sessions
AR sessions frequently keep the microphone open. The paper reviews on-device wake-word detection (Porcupine-class, RNN-T-class), VAD with explicit user signaling, and audio hashing for command recognition without transcript retention. The trade-off is intent latency vs. retained audio.
wake-word detection
VAD
RNN-T
audio hashing
on-device ASR
RP-025MMXXV · viii · 29
λ·IIIOn-device Vision
Detection Under Domain Shift: Interior vs. Exterior
Object detectors trained on COCO underperform on interiors by 12–18 mAP points. The paper reviews fine-tuning, domain randomization, and synthetic data (SUN-RGBD, Hypersim, ScanNet) as remediation. Synthetic data narrows but does not close the gap; the missing factor is texture diversity, not geometry.
COCO
SUN-RGBD
Hypersim
ScanNet
domain randomization
RP-035MMXXV · viii · 21
λ·IVEmbodied RL & World Models
Reward Shaping in Spatial Tasks: A Cautionary Note
Reward shaping in 3D navigation is a perennial source of policy pathology. The paper catalogs shaping bugs that have appeared in published work (loop-back exploits, time-pressure compensation, distance-only rewards rewarding wall-hugging). A defensive checklist is offered.
reward shaping
PPO
potential-based shaping
intrinsic motivation
RND
RP-087MMXXV · viii · 19
λ·XCompute & Inference Systems
Frame Budget Engineering for AR Apps
Hitting 30 fps on phone-class compute is an exercise in subtraction. The paper digests the typical frame: capture, ISP, detection, segmentation, pose, render, composite, display. Each stage has a budget; the paper reports the typical envelope on flagship 2025 devices.
frame budget
ISP latency
render-thread
GPU contention
thermal throttle
RP-015MMXXV · viii · 17
λ·IIReconstruction & Radiance Fields
Photogrammetry at Phone Scale: Lighting Failure Modes
Photogrammetric reconstruction from handheld phone footage fails in three ways: specular surfaces, transparent surfaces, and uniform-textured walls. The paper reviews mitigations (multi-view photometric stereo, polarimetric capture, learned depth priors) and notes that none is a complete answer. A taxonomy of interior failure cases by surface type is provided as a lookup.
photogrammetry
photometric stereo
polarimetric capture
learned priors
MVS
RP-072MMXXV · viii · 12
λ·VIIIInteraction, Hand & Display
Optical See-Through Displays: A 2025 Snapshot
Waveguide optics, geometric optics, and free-form combiners coexist in 2025-era HMDs. The paper indexes field-of-view, eye-relief, eyebox, and ambient-light tolerance. Worth filing: no current display class clears 60° FoV at sustained day-bright contrast.
waveguide
geometric combiner
free-form optic
FoV
ambient contrast
RP-045MMXXV · viii · 07
λ·VMultimodal & Vision-Language
Tool Use in Vision-Language Pipelines
VLMs have been wired into tool-use pipelines (function calling, structured output, JSON-mode). The paper reviews indoor tasks that benefit (calibration, measurement, captioning a known scene) and tasks that do not (real-time tracking). A short note on output-token budgeting under interactive constraints is included.
function calling
JSON mode
ReAct
tool use
structured output
RP-080MMXXV · vii · 29
λ·IXRendering & Optics
Light Estimation: From Spherical Harmonics to LightStages
Real-time light estimation in AR is dominated by spherical harmonic regression from the live frame. The paper compares ARKit's environment textures, PointAR-style estimators, and learned HDR-from-LDR methods. A note: harsh point lights remain difficult.
spherical harmonics
ARKit environment
PointAR
HDR estimation
light probe
RP-054MMXXV · vii · 22
λ·VIPrivacy & On-device ML
Face Blurring: When the Detector Is the Privacy Mechanism
Real-time face blurring on AR streams depends on the underlying face detector. The paper reviews YuNet, BlazeFace, RetinaFace-Mobile, and Mediapipe Face Detector under occlusion, profile views, and low light. The privacy guarantee is bounded by the detector's recall, not its precision.
YuNet
BlazeFace
RetinaFace
MediaPipe Face
privacy filter
RP-064MMXXV · vii · 18
λ·VIISensors, VIO & Calibration
Magnetic Sensors as a Sixth Channel
The magnetometer is the most under-used sensor on a smartphone. Indoor magnetic anomalies are stable and globally unique. The paper reviews magnetic-fingerprint navigation (IndoorAtlas, learned magnetic embeddings) and discusses why this signal is not yet a default in VIO.
magnetometer
magnetic fingerprint
IndoorAtlas
EKF
particle filter
RP-004MMXXV · vii · 03
λ·ILocalization & SLAM
Learned Place Recognition: NetVLAD's Successors
NetVLAD remains the load-bearing place-recognition baseline despite being seven years old. Successors — Patch-NetVLAD, MixVPR, MegaLoc, AnyLoc — improve recall on changing-condition benchmarks (day/night, summer/winter) but the gain narrows on the indoor case. The paper hypothesizes that interior featurelessness defeats the assumption that landmarks are stable. A ranked digest with per-method memory footprint and inference cost is included for practitioners.
NetVLAD
Patch-NetVLAD
MixVPR
MegaLoc
AnyLoc
RP-024MMXXV · vi · 25
λ·IIIOn-device Vision
Monocular Depth on the CPU: A Cold Look
Mobile depth estimation runs primarily on the NPU but degrades gracefully to CPU when the NPU is contended (camera ISP under load, video encode). The paper profiles MiDaS-Small, ZoeDepth-NK, and Depth Anything-Mobile under CPU-only conditions. Latency variance dominates; mean latency is the wrong metric.
MiDaS-Small
ZoeDepth
Depth Anything
CPU inference
latency p99
RP-044MMXXV · vi · 18
λ·VMultimodal & Vision-Language
Multimodal Retrieval: When the Query Is the Room
Retrieval-augmented generation (RAG) has been adapted to spatial corpora. The paper reviews CLIP-feature retrieval, multi-vector retrieval (ColBERT-style), and hybrid sparse-dense retrieval over interior image stores. A finding worth filing: spatial retrieval benefits more from coarse layout features than from fine descriptors.
CLIP retrieval
ColBERT
BM25
RAG
multi-vector
RP-071MMXXV · vi · 14
λ·VIIIInteraction, Hand & Display
Haptic Feedback in AR: When the Phone Replaces the Glove
Mobile haptics — Taptic Engine, Android haptic API — provide a thin but useful feedback channel for AR confirmations. The paper reviews haptic-pattern design language (sharp vs. soft, single vs. paired) and reports user-detection thresholds at typical phone-holding postures.
Taptic Engine
Android haptic API
tactile pattern
vibrotactile
psychophysics
RP-014MMXXV · vi · 08
λ·IIReconstruction & Radiance Fields
Depth Anything v2 and the Great Depth Backbone Shift
Universal monocular-depth backbones (MiDaS, DPT, Depth Anything, ZoeDepth, Marigold) have moved from research curiosities to default scene-understanding components. The paper reports per-pixel scale-invariant error on indoor benchmarks and notes the backbone of choice now changes monthly. A short discussion of distillation paths to mobile is included.
MiDaS
DPT
Depth Anything v2
ZoeDepth
Marigold
RP-034MMXXV · vi · 04
λ·IVEmbodied RL & World Models
JEPA, V-JEPA, and the Self-Supervised Vision Wave
Joint Embedding Predictive Architectures (JEPA) propose self-supervised learning by predicting in a learned representation rather than pixel space. V-JEPA extended this to video. The paper compares JEPA-class methods to MAE, DINOv2, and contrastive baselines on transfer to navigation and detection. Findings are mixed; the regime where JEPA wins is narrower than headline claims.
JEPA
V-JEPA
MAE
DINOv2
contrastive learning
RP-053MMXXV · v · 30
λ·VIPrivacy & On-device ML
Differential Privacy for Spatial Data: ε, δ, and the Floor Plan
Differential privacy guarantees on spatial data are weaker than the literature implies. The paper reviews local-DP, central-DP, and shuffle-DP applied to room-scale telemetry. A finding: floor plans leak through low-ε mechanisms more than the worst-case theory suggests; ε ≤ 1 is required for meaningful protection.
differential privacy
local DP
shuffle DP
Gaussian mechanism
Laplace mechanism
RP-003MMXXV · v · 21
λ·ILocalization & SLAM
GNSS-Denied Indoor: When the Phone Stops Believing the Sky
Inside steel-framed buildings, GNSS pseudo-ranges drift faster than the IMU bias. The Kalman filter must be told to stop trusting the satellite stream. The paper reviews fault-detection-and-exclusion (FDE) heuristics, magnetometer cross-checks, and barometric altimetry as supplementary signals. A finding worth filing: pedestrian dead-reckoning (PDR) augmented by step-length estimation outperforms naive IMU integration by a factor of four on multi-floor traversals.
GNSS FDE
IMU integration
PDR
magnetometer
barometric altimetry
RP-063MMXXV · v · 20
λ·VIISensors, VIO & Calibration
Rolling-Shutter Compensation in Mobile VIO
Most mobile cameras use rolling shutter; classical VIO assumes global shutter. The paper reviews per-row pose interpolation, learned RS undistortion (DeepRS, RS-NeRF), and the residual error after either. Findings: per-row interpolation suffices below 1 m/s motion; learned methods help on faster motion.
rolling shutter
DeepRS
RS-NeRF
BASALT-RS
per-row interpolation
RP-079MMXXV · v · 13
λ·IXRendering & Optics
Differentiable Rendering for AR: An Index
Differentiable rendering (Mitsuba 3, nvdiffrast, GAN-style implicit renderers) underpins most modern view synthesis. The paper indexes them by feature: physically-based, mesh, splat, neural. The lab uses these for ablation rather than runtime; runtime renderers are simpler.
Mitsuba 3
nvdiffrast
PyTorch3D
differentiable rendering
BRDF
RP-023MMXXV · v · 07
λ·IIIOn-device Vision
Real-Time Instance Segmentation: Beyond YOLO-Seg
Instance segmentation on mobile is dominated by YOLO-seg variants for cost reasons, but the gap to research-grade transformer methods (Mask2Former, MaskDINO) remains visible. The paper inventories where the gap matters (occluded thin structures) and where it does not (large convex objects).
YOLO-seg
Mask2Former
MaskDINO
OneFormer
Cascade-RCNN
RP-086MMXXV · v · 04
λ·XCompute & Inference Systems
INT8 vs. FP16: Where the Crossover Sits in 2025
INT8 quantization is now the default for mobile inference; FP16 remains common for newer NPU silicon with native FP16 paths. The paper reports per-op latency for representative kernels and notes that the crossover depends on the operator mix more than on the hardware.
INT8
FP16
BF16
operator fusion
tensor-core
RP-013MMXXV · iv · 29
λ·IIReconstruction & Radiance Fields
Monocular Mesh-from-Video: A Practitioner Digest
Recovering a watertight mesh from a single moving camera is now feasible offline; doing so in real time on a phone is not. The paper reviews COLMAP-class structure-from-motion, Atlas-class learned reconstruction, NeuralRecon, MonoSDF, and SimpleRecon. Findings: learned methods are faster but hallucinate textureless walls; classical methods are slower but honest about uncertainty.
COLMAP
NeuralRecon
MonoSDF
SimpleRecon
Atlas
RP-043MMXXV · iv · 23
λ·VMultimodal & Vision-Language
Small Vision-Language Models: Phi-Vision, MiniCPM-V, Idefics-Mobile
Sub-3B parameter VLMs are now within the inference budget of high-end mobile devices. The paper benchmarks Phi-Vision, MiniCPM-V, Idefics-Mobile, and Moondream on indoor caption-and-ground tasks. Findings: small VLMs are surprisingly capable on naming, mediocre on counting, and weak on spatial reasoning.
Phi-Vision
MiniCPM-V
Idefics
Moondream
VLM benchmark
RP-070MMXXV · iv · 18
λ·VIIIInteraction, Hand & Display
Phone-as-Pointer: A Pre-Glasses Interaction Note
Before head-worn AR is universal, the phone is the pointer. The paper reviews ARKit/ARCore pointer abstractions, ray-casting from the phone center, and laser-pointer-style interaction. The dominant failure mode is fatigue under long sessions.
ARKit ray
ARCore
ray casting
pointer interaction
Fitts' law
RP-033MMXXV · iv · 15
λ·IVEmbodied RL & World Models
Sim-to-Real Without a Real Robot: Phone-Held Embodied Agents
Most embodied-agent literature assumes a robot. A handheld device acting as a 'partial agent' (the human is the actuator) inverts the problem: perception is rich, motor primitives are gestures, success is user-confirmed. The paper sketches an evaluation protocol for handheld embodied tasks (find the door, follow the path, locate the tool).
sim-to-real
Habitat-Matterport
embodied AI
imitation learning
policy gradient
RP-002MMXXV · iv · 08
λ·ILocalization & SLAM
Anchor Durability Across Sessions: A Survey of Persistence Strategies
Persistent anchoring across application launches is approached three ways in current systems: re-localize against a stored sparse map, re-localize against a learned descriptor cloud, or re-localize against a server-side spatial graph. Each pays a different price. Sparse maps are small and brittle to lighting; descriptor clouds are robust to lighting but heavy on storage; server graphs centralize what should be local. The paper enumerates the trade space and notes none of the three handles soft furnishings — the most common interior change.
sparse map
NetVLAD descriptors
MegaLoc
ARKit Persistent Anchors
spatial graph
RP-052MMXXV · iv · 04
λ·VIPrivacy & On-device ML
Federated Learning at Phone Scale: A Reality Check
Federated learning (FL) has been demonstrated at scale by major platforms (Gboard, Siri-style on-device updates). The paper reviews FedAvg, FedProx, SCAFFOLD, and client-drift mitigation. Findings: FL works for narrow tasks (next-word prediction, wake-word) and remains brittle for spatial tasks where client distributions differ wildly.
FedAvg
FedProx
SCAFFOLD
client drift
federated learning
RP-078MMXXV · iii · 26
λ·IXRendering & Optics
Shadow Estimation for AR Objects
Synthetic objects without shadows look pasted on. The paper reviews learned shadow synthesis (Shadow-AR, ShadowGAN), classical shadow-casting from a single light estimate, and probe-based environmental capture. Findings: a single dominant light estimate suffices for most interior scenes.
shadow synthesis
Shadow-AR
light estimation
environment probe
AR rendering
RP-022MMXXV · iii · 19
λ·IIIOn-device Vision
Segment Anything at Phone Scale: SAM, MobileSAM, EfficientSAM
SAM redefined what zero-shot segmentation could do. Its mobile descendants — MobileSAM, EfficientSAM, EdgeSAM — recover most of the quality at a fraction of the cost. The paper reports per-prompt latency and IoU on indoor classes. A short note: SAM-class models are *too* general for tracked-object pipelines; a tighter prompt budget is essential.
SAM
MobileSAM
EfficientSAM
EdgeSAM
IoU
RP-012MMXXV · iii · 15
λ·IIReconstruction & Radiance Fields
NeRF in 2025: From Vanilla to Instant-NGP and Beyond
Neural Radiance Fields (NeRF) defined a generation of view-synthesis work. Five years on, the field has bifurcated: hash-grid encodings (Instant-NGP) for speed; tensor decompositions (TensoRF, K-Planes) for memory; and discrete primitives (3DGS) for both. The paper treats this as a Pareto front rather than a winner-takes-all, and indexes which method dominates which corner of the (training-time, render-time, memory) cube.
NeRF
Instant-NGP
TensoRF
K-Planes
3DGS
RP-042MMXXV · iii · 11
λ·VMultimodal & Vision-Language
Grounding the Frame: From CLIP to GroundingDINO to Molmo
Grounding — answering 'where is X in this image?' — has matured rapidly. The paper compares CLIP-based zero-shot pointing, GroundingDINO box outputs, GLIP, KOSMOS-2 grounded captioning, and Molmo's pointing tokens. Findings: pointing-token methods are surprisingly accurate for a one-shot output but lossy under occlusion.
CLIP
GroundingDINO
GLIP
KOSMOS-2
Molmo
RP-062MMXXV · iii · 04
λ·VIISensors, VIO & Calibration
Camera-IMU Time Synchronization: A Persistent Tax
Sub-millisecond camera-IMU time alignment is a precondition for tight VIO. The paper reviews hardware-level synchronization, software estimation (Kalibr-style), and learned correction. A finding: published methods assume static alignment; in practice phones drift across thermal cycles.
Kalibr
VIO synchronization
thermal drift
OpenVINS
BASALT
RP-032MMXXV · ii · 26
λ·IVEmbodied RL & World Models
Indoor Navigation Policies: From PointGoal to ObjectNav
Habitat and AI2-THOR have anchored embodied-agent research for half a decade. The paper benchmarks recurrent (LSTM-based) and transformer-based policies on PointGoal and ObjectNav, and reports a sim-to-real gap that has narrowed but not closed. Real-world deployment fails for reasons (clutter, thin obstacles, glass) that simulation under-models.
Habitat
AI2-THOR
PointGoal
ObjectNav
transformer policy
RP-051MMXXV · ii · 19
λ·VIPrivacy & On-device ML
On-Device by Default: A Position Note
The lab's posture is that spatial data — room geometry, faces, voice — is processed on device unless the operator explicitly opts in to upload. The paper reviews the technical cost of this position (model size, compute envelope, storage) against the cost of the alternative (network egress, server-side retention, breach surface).
on-device ML
Core ML
TensorFlow Lite
ONNX Runtime
edge inference
RP-001MMXXV · ii · 14
λ·ILocalization & SLAM
Drift Persists: Re-evaluating Visual-Inertial Odometry on Long Sessions
Across the visual-inertial odometry literature the headline accuracy figure is computed over sessions under three minutes. The lab finds error compounds non-linearly past the ten-minute mark in featureless interiors. The dominant failure mode is not gyroscope bias but loss of feature continuity during fast yaw; loop-closure rescues are rare without a place-recognition prior. The paper contrasts MSCKF, OKVIS, and VINS-Mono against an extended ORB-SLAM3 baseline on a 90-minute interior walkthrough.
VIO
MSCKF
OKVIS
VINS-Mono
ORB-SLAM3
RP-069MMXXV · ii · 11
λ·VIIIInteraction, Hand & Display
Gaze, Pinch, and the Glasses-Era Interaction Vocabulary
Hands-free spatial UI is converging on a small vocabulary: gaze for targeting, pinch for selection, palm rotations for scaling, and voice for naming. The paper reviews input studies from major HMD platforms and notes the surprising consistency of pinch-as-select across vendors.
gaze tracking
pinch detection
spatial UI
Fitts' law
hand pose
RP-085MMXXV · ii · 08
λ·XCompute & Inference Systems
Mobile NPU Inventory: Apple ANE, Qualcomm Hexagon, Tensor TPU
The mobile NPU landscape is a three-horse race in early 2025: Apple's Neural Engine, Qualcomm's Hexagon NPU, and Google's Tensor TPU. The paper benchmarks representative networks (YOLOv8n, MobileSAM, MiDaS-Small, Phi-3.5-mini) across each. Findings: the gap is narrower than vendor claims, wider than developer experience suggests.
ANE
Hexagon NPU
Tensor TPU
Core ML
QNN
RP-021MMXXV · ii · 03
λ·IIIOn-device Vision
YOLO at the Edge: From v8 to v11 on Mobile NPUs
The YOLO family has continued its quiet competence streak through versions 8–11. The paper benchmarks v8n, v9c, v10s, and v11n at 320×320 on three mobile NPU classes. Findings: v11n offers the strongest cost/accuracy ratio for interior-object detection; v8n remains the most predictable in latency variance. Not all gains generalize away from COCO.
YOLOv8
YOLOv9
YOLOv10
YOLOv11
COCO benchmark
RP-041MMXXV · i · 28
λ·VMultimodal & Vision-Language
Vision-Language Models for the Room: A Field Survey
VLMs (CLIP, BLIP-2, LLaVA, Qwen-VL, InternVL, Florence-2) have absorbed a generation of separate captioning, VQA, and grounding models. The paper inventories indoor-scene tasks where VLMs now dominate (object naming, attribute extraction, scene description) and tasks where they still underperform (spatial relations, counting, fine pose).
CLIP
BLIP-2
LLaVA
Qwen-VL
InternVL
RP-011MMXXV · i · 22
λ·IIReconstruction & Radiance Fields
Gaussian Splatting at the Edge: A Compute Inventory
3D Gaussian Splatting (3DGS) reconstructs scenes as a cloud of anisotropic Gaussians and renders them at real-time rates on desktop GPUs. The paper inventories what is needed to deliver the same on mobile NPUs: a 4–10× reduction in splat count, INT8 quantization of opacity and SH coefficients, and a tiled rasterizer. The lab notes the open problem is not rendering but training: optimization currently demands 30+ minutes on consumer hardware.
3DGS
spherical harmonics
tiled rasterization
INT8 quantization
anisotropic Gaussians
RP-077MMXXV · i · 15
λ·IXRendering & Optics
Compositing the Real and the Synthetic
AR rendering is fundamentally a compositing problem: the synthetic image must match the live image in exposure, color, and motion blur. The paper reviews real-time tone mapping, color-matching neural networks, and motion-blur transfer. The match is rarely perfect; user tolerance for the gap is wider than literature claims.
compositing
tone mapping
color match
motion blur
ACES
RP-031MMXXV · i · 12
λ·IVEmbodied RL & World Models
World Models in 2025: DreamerV3 and Beyond
World models — agents that learn a latent forward dynamics model and plan inside it — have re-entered the mainstream after DreamerV3 demonstrated cross-domain generalization. The paper reviews the V1–V3 progression, IRIS, TWM, and the JEPA family. The transition from pixel-space to latent-space planning is treated as the field's central inflection.
DreamerV3
IRIS
TWM
JEPA
world model
RP-061MMXXV · i · 08
λ·VIISensors, VIO & Calibration
IMU Calibration in 2025: Stationary, Allan Variance, and Online
IMU calibration splits into factory-stationary methods (Allan variance for noise characterization), in-situ stationary refinement, and online calibration during use. The paper reviews each and notes that online calibration has matured to where dedicated stationary calibration adds little for consumer devices. Allan-variance work remains essential for sensor selection.