ARIA-NBV: Target-Aware Next-Best-View Planning

Quality-driven NBV with ASE, EFM3D, Relative Reconstruction Improvement, and VIN-style candidate scoring

Author

Jan Duchscherer, Munich University of Applied Sciences

Published

June 23, 2026

1 Abstract

Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time).

Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [1] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Directly optimizing reconstruction quality, as pioneered by VIN-NBV [2], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV’s generalization capabilities are limited to simpler object-centric NBV scenarios because it does not leverage pre-trained foundation models with rich 3D spatial understanding.

This project aims to develop an NBV system that integrates VIN-NBV’s key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained egocentric foundation model as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the EVL (Egocentric Voxel Lifting) 3D EFM [3] which is pre-trained on the Aria Synthetic Environments (ASE) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views.

2 Project Vision and Goals

2.1 Done

  • Develop an oracle RRI computation pipeline using ASE visibility data, semi-dense point clouds, GT meshes, and the generated rri_metrics API contracts.
  • Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV.
  • Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing PyTorch3d, EFM3D and ATEK.

2.2 WIP

  • Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs.
  • Leverage EVL’s 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making.
  • Entity-aware reconstruction tracking through EVL’s OBB detection capabilities.

2.3 Future Work

  • Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions.

3 Documentation Navigation

The current thesis direction is owned by the thesis roadmap, research questions, and canonical project memory. The seminar paper records historical implemented evidence from the earlier project phase and should not override the current thesis contract.

3.1 Paper

3.2 Project Slides

3.2.1 Project Presentations

3.3 Setup & Installation

3.4 Thesis State

3.5 Theory & Background

3.6 Dataset & Resources

3.7 Literature Reviews

  • Literature Review: Entry point and local LaTeX corpus
  • VIN-NBV: Direct quality optimization with RRI
  • GenNBV: Continuous action spaces and RL approaches
  • EFM3D & EVL: Egocentric foundation models and voxel lifting
  • SceneScript: Structured scene language and entity representation

3.8 Implementation Contracts

  • API Reference: Generated package contracts for datasets, immutable VIN offline stores, target selection, rollout Zarr, finite-candidate generation, rendering, RRI metrics, and VIN one-step scoring.
  • Setup Instructions: Environment, cache, and smoke commands for local validation.
  • Architecture Diagrams: Generated context diagrams and package-level views.

References

[1]
X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang, “GenNBV: Generalizable next-best-view policy for active 3D reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16436–16445. Available: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_GenNBV_Generalizable_Next-Best-View_Policy_for_Active_3D_Reconstruction_CVPR_2024_paper.html
[2]
N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219
[3]
J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224