Research Questions
1 Research Questions
This section documents key research questions that guide our investigation into semantic Next-Best-View planning.
1.1 1. RRI Computation Strategies
1.1: Which oracle RRI formulation is most predictive of actual reconstruction improvement?
- How can we effectively compute the RRI for candidate views in the ASE dataset given its limitations?
- Which metrics should be used to quantify the similarity between point clouds?
- What metrics might be more suitable than Chamfer Distance for this task?
- How can we sample point clouds from candidate views that are comparable to the existing semi-dense SLAM point clouds in ASE such that we can compute the RRI in a meaningful way?
1.2 2. RRI prediction Model Architectures
- Should we explicity project features into the frame of the candidate view, or rather use learnable fourier features or other positional encodigns?
- Maybe initialize weights such that they correspond to the projection operation, but then allow finetuning? This is certainly a TODO for the master thesis.
- What Input and Output formulations could work best for the RRI prediction model? There are many features from the EFM pipeline that could be useful:
- Occupancy probabilities, centerness scores, visibility scores for known points in the semi-dense SLAM point cloud…
1.3 2. EFMs as NBV Backbone
Q2.1: Can EFMs like EVL serve as effective encoders for NBV planning?
Rationale: EFMs provide semantic understanding that presentmethods lack. This should enable better generalization to complex scenes. Egocentric platforms like Aria glasses or even an iPhone with LiDAR offer rich multi-modal data tha can be leveraged to obtain a better scene representation.
Q2.2:
- How do we efficiently handle streaming point cloud updates using an EFM backbone?
- How can we incrementally update the scene representation as new observations arrive?
- How can we re-use previously computed features to avoid redundant computations?
1.4 3. Entity-Aware NBV
Q3.1:
- Can we compute something like a reconstruction completeness score per entity to guide NBV selection?
- Can RRI be expressed both per entity and per scene to improve NBV selection?
- Can the NBV selection be conditioned on the std. per-pose RRI as well as the reconstruction score per entity of interest?
- Multi‑modal data: How does integrating other modalities (encoded bboxes, object semantics like segmentation maps) alongside the PC / RGB-D embeddings improve NBV policy performance?
Hypothesis: Entity-aware NBV allows task-specific prioritization and should outperform allow the model to pay attention to the objects of interest and hence improve the reconstruction quality for those. This is especially useful in human-in-the-loop scenarios, allowing the user to spcify “I want this table scanned well”. This should allow to leverage pre-trained semantic and geometric understanding; Entity-level representations come with rich priors about typical shapes.
Proposed Formulation:
\[ \text{Fitness}_{\text{total}}(c, \mathcal{E}) = \sum_{e \in \mathcal{E}} w_e \cdot \text{RRI}_e(c) + \lambda \cdot \text{RRI}_{\text{scene}}(c) \]
- \(\text{RRI}_e\): Reconstruction score for entity \(e\)
- \(w_e\): Importance weight (user-specified or uncertainty-based)
1.5 4. Continuous vs. Discrete Action Spaces
Q4.1: Should NBV prediction output discrete view selection or continuous pose regression?
- How can we combine the benefits of both approaches? i.e. by sampling a batch of candidate views from a continous distribution whose parameters are regressed, followed by discrete selection among these candidates.
- How can we ensure collision-free views in a continuous action space? Usage of free space
1.5.1 VIN-NBV: Classification Over Discrete Candidates
Approach:
- Sample \(N\) random candidate poses (e.g., \(N=100\))
- Compute RRI for each
- Select argmax
Why they did this:
- Avoids regression difficulties (multi-modal distributions)
- Easier to train (classification > regression for complex distributions)
- Guarantees collision-free poses (if sampled carefully)
Limitations:
- Requires many samples for good coverage
- May miss optimal view between samples
- No smooth trajectory planning
1.5.2 GenNBV: Continuous 5-DoF Regression
Approach:
- Predict \((x, y, z, \theta, \phi)\) directly via regression
- Use RL (PPO) to handle multi-modal distributions
Advantages:
- Smooth, continuous trajectories
- Theoretically optimal (not limited by sampling)
- Suitable for real-time robotics
Challenges:
- Harder to train (regression on multi-modal targets)
- May predict colliding poses (need explicit constraints)
1.6 LLM Integration for NBV
- LLM integration: How can language models translate coverage maps and semantic annotations into natural language explanations or high‑level navigation strategies to allow for an easier human machine (NBV-agent) interaction? Could state of the art VLAs like Gemini be employed to get OBBs for entity level representations of the scene?
- Might this also allow high‑level planning and getting qualitative feedback on the current reconstruction status?
- Could the LLMs high-level planning capabilities be used to guide the NBV policy towards task-relevant areas?