ARIA-NBV: Target-Aware Next-Best-View Planning

Quality-driven NBV with ASE, EFM3D, Relative Reconstruction Improvement, and VIN-style candidate scoring

Author

Jan Duchscherer, Munich University of Applied Sciences

Published

June 23, 2026

1 Abstract

Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time).

Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [1] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Directly optimizing reconstruction quality, as pioneered by VIN-NBV [2], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV’s generalization capabilities are limited to simpler object-centric NBV scenarios because it does not leverage pre-trained foundation models with rich 3D spatial understanding.

This project aims to develop an NBV system that integrates VIN-NBV’s key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained egocentric foundation model as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the EVL (Egocentric Voxel Lifting) 3D EFM [3] which is pre-trained on the Aria Synthetic Environments (ASE) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views.

2 Project Vision and Goals

2.1 Done

Develop an oracle RRI computation pipeline using ASE visibility data, semi-dense point clouds, GT meshes, and the generated rri_metrics API contracts.
Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV.
Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing PyTorch3d, EFM3D and ATEK.

2.2 WIP

Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs.
Leverage EVL’s 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making.
Entity-aware reconstruction tracking through EVL’s OBB detection capabilities.

2.3 Future Work

Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions.

3 Documentation Navigation

The current thesis direction is owned by the thesis roadmap, research questions, and canonical project memory. The seminar paper records historical implemented evidence from the earlier project phase and should not override the current thesis contract.

3.1 Paper

Project Paper, Typst Source: Historical implemented evidence from the earlier project phase.

3.2 Project Slides

3.2.1 Project Presentations

Presentation 01, typst-src
Presentation 02, typst-src
Presentation 04, typst-src: Final presentation of the earlier project phase.

3.3 Setup & Installation

Setup Instructions: Environment setup and dependencies

3.4 Thesis State

Project Roadmap: Milestones and timeline
Research Questions: Open problems and directions

3.5 Theory & Background

NBV Background: Problem framing and prior work
RRI Theory: Mathematical formulation and properties of RRI
Surface Reconstruction Metrics: Accuracy, completeness, Chamfer distance
Semi-Dense Point Clouds: SLAM-based reconstruction signals

3.6 Dataset & Resources

Aria Synthetic Environments (ASE) Dataset: Modalities, splits, mesh availability
Resources & Tools: External links to libraries, tools, datasets
Glossary: Project terminology

3.7 Literature Reviews

Literature Review: Entry point and local LaTeX corpus
VIN-NBV: Direct quality optimization with RRI
GenNBV: Continuous action spaces and RL approaches
EFM3D & EVL: Egocentric foundation models and voxel lifting
SceneScript: Structured scene language and entity representation

3.8 Implementation Contracts

API Reference: Generated package contracts for datasets, immutable VIN offline stores, target selection, rollout Zarr, finite-candidate generation, rendering, RRI metrics, and VIN one-step scoring.
Setup Instructions: Environment, cache, and smoke commands for local validation.
Architecture Diagrams: Generated context diagrams and package-level views.

References

[1]

X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang, “GenNBV: Generalizable next-best-view policy for active 3D reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16436–16445. Available: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_GenNBV_Generalizable_Next-Best-View_Policy_for_Active_3D_Reconstruction_CVPR_2024_paper.html

[2]

N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219

[3]

J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224

--- title: "ARIA-NBV: Target-Aware Next-Best-View Planning" phase: thesis audience: public status: current owner: jan subtitle: "Quality-driven NBV with ASE, EFM3D, Relative Reconstruction Improvement, and VIN-style candidate scoring" author: "Jan Duchscherer, Munich University of Applied Sciences" date: last-modified format: html: toc: true number-sections: true bibliography: references.bib --- # Abstract Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time). Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [@GenNBV-chen2024] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Directly optimizing reconstruction quality, as pioneered by VIN-NBV [@VIN-NBV-frahm2025], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV's generalization capabilities are limited to simpler object-centric NBV scenarios because it does not leverage pre-trained foundation models with rich 3D spatial understanding. This project aims to develop an NBV system that integrates VIN-NBV's key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained _egocentric foundation model_ as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the [EVL (_Egocentric Voxel Lifting_)](contents/literature/efm3d.qmd) 3D EFM [@EFM3D-straub2024] which is pre-trained on the [Aria Synthetic Environments (ASE)](contents/ase_dataset.qmd) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views. # Project Vision and Goals ## Done - Develop an oracle RRI computation pipeline using ASE visibility data, semi-dense point clouds, GT meshes, and the generated [`rri_metrics` API contracts](reference/index.qmd). - Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV. - Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing [`PyTorch3d`](https://pytorch3d.org/), [EFM3D](https://github.com/facebookresearch/efm3d) and [ATEK](https://github.com/facebookresearch/ATEK). ## WIP - Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs. - Leverage EVL's 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making. - Entity-aware reconstruction tracking through EVL's OBB detection capabilities. ## Future Work - Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions. # Documentation Navigation The current thesis direction is owned by the thesis roadmap, research questions, and canonical project memory. The seminar paper records historical implemented evidence from the earlier project phase and should not override the current thesis contract. ## Paper - [Project Paper](typst/seminar_paper/main.pdf), [Typst Source](typst/seminar_paper/main.typ): Historical implemented evidence from the earlier project phase. ## Project Slides ### Project Presentations - [Presentation 01](typst/seminar_slides/slides_1.pdf), [typst-src](typst/seminar_slides/slides_1.typ) - [Presentation 02](typst/seminar_slides/slides_2.pdf), [typst-src](typst/seminar_slides/slides_2.typ) - [Presentation 04](typst/seminar_slides/slides_4.pdf), [typst-src](typst/seminar_slides/slides_4.typ): Final presentation of the earlier project phase. ## Setup & Installation - [Setup Instructions](contents/setup.qmd): Environment setup and dependencies ## Thesis State - **[Project Roadmap](contents/thesis/roadmap.qmd)**: Milestones and timeline - **[Research Questions](contents/thesis/questions.qmd)**: Open problems and directions ## Theory & Background - **[NBV Background](contents/theory/nbv_background.qmd)**: Problem framing and prior work - **[RRI Theory](contents/theory/rri_theory.qmd)**: Mathematical formulation and properties of RRI - **[Surface Reconstruction Metrics](contents/theory/surface_metrics.qmd)**: Accuracy, completeness, Chamfer distance - **[Semi-Dense Point Clouds](contents/theory/semi-dense-pc.qmd)**: SLAM-based reconstruction signals ## Dataset & Resources - **[Aria Synthetic Environments (ASE) Dataset](contents/ase_dataset.qmd)**: Modalities, splits, mesh availability - **[Resources & Tools](contents/resources.qmd)**: External links to libraries, tools, datasets - **[Glossary](contents/glossary.qmd)**: Project terminology ## Literature Reviews - **[Literature Review](contents/literature/index.qmd)**: Entry point and local LaTeX corpus - **[VIN-NBV](contents/literature/vin_nbv.qmd)**: Direct quality optimization with RRI - **[GenNBV](contents/literature/gen_nbv.qmd)**: Continuous action spaces and RL approaches - **[EFM3D & EVL](contents/literature/efm3d.qmd)**: Egocentric foundation models and voxel lifting - **[SceneScript](contents/literature/scene_script.qmd)**: Structured scene language and entity representation ## Implementation Contracts - **[API Reference](reference/index.qmd)**: Generated package contracts for datasets, immutable VIN offline stores, target selection, rollout Zarr, finite-candidate generation, rendering, RRI metrics, and VIN one-step scoring. - **[Setup Instructions](contents/setup.qmd)**: Environment, cache, and smoke commands for local validation. - **[Architecture Diagrams](contents/diagrams.qmd)**: Generated context diagrams and package-level views.