1 GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction
[1] introduce GenNBV, an end-to-end reinforcement-learning policy that plans in a continuous five-degree-of-freedom view space.
Links: Project Page, GitHub
- Free-space control: Actions are sampled from a learned 5D Gaussian over \((x,y,z,\text{yaw},\text{pitch})\), enabling collision-aware trajectories without hand-crafted candidates.
- Multi-source embedding: Probabilistic occupancy, RGB semantics, and action history are fused into a shared state representation that predicts reconstruction progress.
- Training: In Isaac Gym; 24h on an NVIDIA Tesla V100 (16GB VRAM)
- Benchmarking: Established a benchmark using the Isaac Gym Simulator with the Houses3K (3k textured 3D building models) and OmniObject3D (4k high-fidelity scans of objects across 190 categories) datasets, yielding 98.3% AUC and 97.1% final coverage ratio on unseen building-scale scenes.
1.1 Local Sources
main.tex– full CVPR submission.Supp/supp.tex– supplemental material and AUC curves.Figure/Fig_Teaser_v5.pdf– teaser used in our docs.README.md– build instructions and compilation notes.
1.2 Problem Setup and I/O
Each time step \(t\) yields an observation \[ o_t = \{I_{1:t}, D_{1:t}, a_{1:t}, p_t\}, \] where \(I\) and \(D\) are RGB and depth frames, \(a_{1:t} \in \mathcal{A} \subset \text{SE}(3)\) are historic actions, and \(p_t \in \mathbb{R}^3\) encodes the current pose. The policy applies an encoder \(\phi\) to produce a state embedding \(s_t=\phi(o_t)\) and samples the next viewpoint from a Gaussian policy, \[ a_t \sim \pi_\theta(a\,|\,s_t) = \mathcal{N}\!\left(\mu_\theta(s_t),\,\Sigma_\theta(s_t)\right), \] yielding a 5D camera pose that is executed by a CrazyFlie UAV in simulation.
GenNBV’s geometric backbone maintains a probabilistic occupancy grid \(F_t^G\) that is updated by log-odds accumulation along each sensor ray \(z_j\):
with constant \(C\) reflecting ray confidence. Voxels are thresholded into \(\{\text{occupied}, \text{free}, \text{unknown}\}\) to summarize coverage and guide planning.
1.3 State Embedding and Policy Architecture
Geometric encoder:
Depth maps are back-projected to obtain a 3D PC, voxelized, and mapped to a three-state occupancy grid: \(F_t^G \in \{\text{occupied}, \text{free}, \text{unknown}\}^{H\times W \times D }\), where each voxel is assigned to a state of it being occupied, free, or unknown (i.e. to what degree the voxel has been observed). \(F_t^G\) is updated at each step using Bresenham’s line algorithm by tracing rays from the camera origin and the back-projected 3D points:
\[ \log \mathrm{Odd}(v_i \mid z_j) = \log \mathrm{Odd}(v_i) + C, \]
where \(v_i\) denotes the occupancy likelihood of the \(i\)-th voxel given the probability of the \(j\)-th ray from the camera passes through it. The classification into the three states is done by thresholding the log-odds values. The encoder network consists of a simple MLP after flattening the occupancy grid.
Semantic encoder: A two-layer CNN, with consequent one-layer MLP consumes the concatenated \(I_{t-k:t}\) grayscale frames to produce \(s_t^S\).
Action history encoder: Recent 5D poses are linearly projected to \(s_t^A\).
State Embedding and Policy Network: The concatenated vector \(s_t = \mathrm{Linear}(s_t^G; s_t^S; s_t^A)\) feeds a three-layer MLP that outputs the mean and variance of the continuous action distribution.
1.4 Reward Design and Optimization
Coverage is measured by thresholding \(F_t^G\) to count occupied voxels \(\tilde N_t\) relative to the ground-truth surface voxel count \(N^*\): \[ \text{CR}_t = \frac{\tilde N_t}{N^*}\cdot 100\%. \] The primary reward is the temporal gain in coverage, similar to TD(0): \[ r^{\text{CR}}_{t+1} = \text{CR}_{t+1} - \text{CR}_t, \] augmented with penalties for collisions and overly long trajectories. Policy parameters are optimized with proximal policy optimization (PPO) using the clipped surrogate, \[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\big(\eta_t(\theta) A_t,\; \mathrm{clip}(\eta_t(\theta), 1-\epsilon, 1+\epsilon) A_t\big)\right], \] where \(\eta_t(\theta)=\tfrac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}\) and \(A_t\) is the advantage estimate.
1.5 Notes and Limitations
- Coverage-centric rewards ignore reconstruction fidelity; surfaces with equal occupancy weight receive identical priority.
- Semantic reasoning is limited to appearance cues and lacks task-driven surface weighting.
- Architecture lacks explicit 3D spatial reasoning-
- Doesn’t leverage powerful geometric embeddings (e.g. Fourier features)
- Primitive Architecture: simple MLPs and CNNs without leveraging pre-trained models or advanced architectures.1