Research Questions

1 Research Questions

This section documents key research questions that guide our investigation into semantic Next-Best-View planning.

1.1 1. RRI Computation Strategies

1.1: Which oracle RRI formulation is most predictive of actual reconstruction improvement?

How can we effectively compute the RRI for candidate views in the ASE dataset given its limitations?
Which metrics should be used to quantify the similarity between point clouds?
What metrics might be more suitable than Chamfer Distance for this task?
How can we sample point clouds from candidate views that are comparable to the existing semi-dense SLAM point clouds in ASE such that we can compute the RRI in a meaningful way?

1.2 2. RRI prediction Model Architectures

Should we explicity project features into the frame of the candidate view, or rather use learnable fourier features or other positional encodigns?
Maybe initialize weights such that they correspond to the projection operation, but then allow finetuning? This is certainly a TODO for the master thesis.
What Input and Output formulations could work best for the RRI prediction model? There are many features from the EFM pipeline that could be useful:
- Occupancy probabilities, centerness scores, visibility scores for known points in the semi-dense SLAM point cloud…

1.3 2. EFMs as NBV Backbone

Q2.1: Can EFMs like EVL serve as effective encoders for NBV planning?

Rationale: EFMs provide semantic understanding that presentmethods lack. This should enable better generalization to complex scenes. Egocentric platforms like Aria glasses or even an iPhone with LiDAR offer rich multi-modal data tha can be leveraged to obtain a better scene representation.

Q2.2:

How do we efficiently handle streaming point cloud updates using an EFM backbone?
How can we incrementally update the scene representation as new observations arrive?
How can we re-use previously computed features to avoid redundant computations?

1.4 3. Entity-Aware NBV

Q3.1:

Can we compute something like a reconstruction completeness score per entity to guide NBV selection?
Can RRI be expressed both per entity and per scene to improve NBV selection?
Can the NBV selection be conditioned on the std. per-pose RRI as well as the reconstruction score per entity of interest?
Multi‑modal data: How does integrating other modalities (encoded bboxes, object semantics like segmentation maps) alongside the PC / RGB-D embeddings improve NBV policy performance?

Hypothesis: Entity-aware NBV allows task-specific prioritization and should outperform allow the model to pay attention to the objects of interest and hence improve the reconstruction quality for those. This is especially useful in human-in-the-loop scenarios, allowing the user to spcify “I want this table scanned well”. This should allow to leverage pre-trained semantic and geometric understanding; Entity-level representations come with rich priors about typical shapes.

Proposed Formulation:

\[ \text{Fitness}_{\text{total}}(c, \mathcal{E}) = \sum_{e \in \mathcal{E}} w_e \cdot \text{RRI}_e(c) + \lambda \cdot \text{RRI}_{\text{scene}}(c) \]

$\text{RRI}_e$: Reconstruction score for entity $e$
$w_e$: Importance weight (user-specified or uncertainty-based)

1.5 4. Continuous vs. Discrete Action Spaces

Q4.1: Should NBV prediction output discrete view selection or continuous pose regression?

How can we combine the benefits of both approaches? i.e. by sampling a batch of candidate views from a continous distribution whose parameters are regressed, followed by discrete selection among these candidates.
How can we ensure collision-free views in a continuous action space? Usage of free space

1.5.1 VIN-NBV: Classification Over Discrete Candidates

Approach:

Sample $N$ random candidate poses (e.g., $N=100$)
Compute RRI for each
Select argmax

Why they did this:

Avoids regression difficulties (multi-modal distributions)
Easier to train (classification > regression for complex distributions)
Guarantees collision-free poses (if sampled carefully)

Limitations:

Requires many samples for good coverage
May miss optimal view between samples
No smooth trajectory planning

1.5.2 GenNBV: Continuous 5-DoF Regression

Approach:

Predict $(x, y, z, \theta, \phi)$ directly via regression
Use RL (PPO) to handle multi-modal distributions

Advantages:

Smooth, continuous trajectories
Theoretically optimal (not limited by sampling)
Suitable for real-time robotics

Challenges:

Harder to train (regression on multi-modal targets)
May predict colliding poses (need explicit constraints)

1.6 LLM Integration for NBV

LLM integration: How can language models translate coverage maps and semantic annotations into natural language explanations or high‑level navigation strategies to allow for an easier human machine (NBV-agent) interaction? Could state of the art VLAs like Gemini be employed to get OBBs for entity level representations of the scene?
- Might this also allow high‑level planning and getting qualitative feedback on the current reconstruction status?
- Could the LLMs high-level planning capabilities be used to guide the NBV policy towards task-relevant areas?

--- title: "Research Questions" format: html bibliography: ../references.bib --- # Research Questions This section documents key research questions that guide our investigation into semantic Next-Best-View planning. ## 1. RRI Computation Strategies **1.1**: Which oracle RRI formulation is most predictive of actual reconstruction improvement? - How can we effectively compute the RRI for candidate views in the ASE dataset given its limitations? - Which metrics should be used to quantify the similarity between point clouds? - What metrics might be more suitable than Chamfer Distance for this task? - How can we sample point clouds from candidate views that are comparable to the existing semi-dense SLAM point clouds in ASE such that we can compute the RRI in a meaningful way? --- ## 2. RRI prediction Model Architectures - Should we explicity project features into the frame of the candidate view, or rather use learnable fourier features or other positional encodigns? - Maybe initialize weights such that they correspond to the projection operation, but then allow finetuning? This is certainly a TODO for the master thesis. - What Input and Output formulations could work best for the RRI prediction model? There are many features from the EFM pipeline that could be useful: - Occupancy probabilities, centerness scores, visibility scores for known points in the semi-dense SLAM point cloud... --- ## 2. EFMs as NBV Backbone **Q2.1**: Can EFMs like EVL serve as effective encoders for NBV planning? **Rationale**: EFMs provide semantic understanding that presentmethods lack. This should enable better generalization to complex scenes. Egocentric platforms like Aria glasses or even an iPhone with LiDAR offer rich multi-modal data tha can be leveraged to obtain a better scene representation. **Q2.2**: - How do we efficiently handle streaming point cloud updates using an EFM backbone? - How can we incrementally update the scene representation as new observations arrive? - How can we re-use previously computed features to avoid redundant computations? --- ## 3. Entity-Aware NBV **Q3.1**: - Can we compute something like a reconstruction completeness score per entity to guide NBV selection? - Can RRI be expressed both per entity and per scene to improve NBV selection? - Can the NBV selection be conditioned on the std. per-pose RRI as well as the reconstruction score per entity of interest? - **Multi‑modal data:** How does integrating other modalities (encoded bboxes, object semantics like segmentation maps) alongside the PC / RGB-D embeddings improve NBV policy performance? **Hypothesis**: Entity-aware NBV allows task-specific prioritization and should outperform allow the model to pay attention to the objects of interest and hence improve the reconstruction quality for those. This is especially useful in human-in-the-loop scenarios, allowing the user to spcify "I want this table scanned well". This should allow to leverage pre-trained semantic and geometric understanding; Entity-level representations come with rich priors about typical shapes. **Proposed Formulation**: $$ \text{Fitness}_{\text{total}}(c, \mathcal{E}) = \sum_{e \in \mathcal{E}} w_e \cdot \text{RRI}_e(c) + \lambda \cdot \text{RRI}_{\text{scene}}(c) $$ - $\text{RRI}_e$: Reconstruction score for entity $e$ - $w_e$: Importance weight (user-specified or uncertainty-based) --- ## 4. Continuous vs. Discrete Action Spaces **Q4.1**: Should NBV prediction output discrete view selection or continuous pose regression? - How can we combine the benefits of both approaches? i.e. by sampling a batch of candidate views from a continous distribution whose parameters are regressed, followed by discrete selection among these candidates. - How can we ensure collision-free views in a continuous action space? Usage of free space ### VIN-NBV: Classification Over Discrete Candidates **Approach:** - Sample $N$ random candidate poses (e.g., $N=100$) - Compute RRI for each - Select argmax **Why they did this:** - Avoids regression difficulties (multi-modal distributions) - Easier to train (classification > regression for complex distributions) - Guarantees collision-free poses (if sampled carefully) **Limitations:** - Requires many samples for good coverage - May miss optimal view between samples - No smooth trajectory planning ### GenNBV: Continuous 5-DoF Regression **Approach:** - Predict $(x, y, z, \theta, \phi)$ directly via regression - Use RL (PPO) to handle multi-modal distributions **Advantages:** - Smooth, continuous trajectories - Theoretically optimal (not limited by sampling) - Suitable for real-time robotics **Challenges:** - Harder to train (regression on multi-modal targets) - May predict colliding poses (need explicit constraints) --- ## LLM Integration for NBV - **LLM integration:** How can language models translate coverage maps and semantic annotations into natural language explanations or high‑level navigation strategies to allow for an easier human machine (NBV-agent) interaction? Could state of the art VLAs like Gemini be employed to get OBBs for entity level representations of the scene? - Might this also allow high‑level planning and getting qualitative feedback on the current reconstruction status? - Could the LLMs high-level planning capabilities be used to guide the NBV policy towards task-relevant areas?