ARIA-NBV: Target-Aware Next-Best-View Planning
Quality-driven NBV with ASE, EFM3D, Relative Reconstruction Improvement, and VIN-style candidate scoring
1 Abstract
Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time).
Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [1] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Directly optimizing reconstruction quality, as pioneered by VIN-NBV [2], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV’s generalization capabilities are limited to simpler object-centric NBV scenarios because it does not leverage pre-trained foundation models with rich 3D spatial understanding.
This project aims to develop an NBV system that integrates VIN-NBV’s key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained egocentric foundation model as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the EVL (Egocentric Voxel Lifting) 3D EFM [3] which is pre-trained on the Aria Synthetic Environments (ASE) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views.
2 Project Vision and Goals
2.1 Done
- Develop an oracle RRI computation pipeline using ASE visibility data, semi-dense point clouds, GT meshes, and the generated
rri_metricsAPI contracts. - Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV.
- Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing
PyTorch3d, EFM3D and ATEK.
2.2 WIP
- Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs.
- Leverage EVL’s 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making.
- Entity-aware reconstruction tracking through EVL’s OBB detection capabilities.
2.3 Future Work
- Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions.