Master Thesis Research Questions

1 Master Thesis Research Questions

This page states the research questions for the master’s thesis phase of ARIA-NBV. The central question is:

Can ARIA-NBV perform target-conditioned, RRI-based multi-step NBV by training a finite-candidate value model $Q_{H,\theta}$ that predicts bounded cumulative target-specific RRI for a target of interest and improves endpoint target reconstruction quality, after first measuring oracle-lookahead headroom over one-step selection under a fixed acquisition budget?

The questions are tied to the dated roadmap, the RRI theory page, the finite-candidate rollout and Q_H contract, the candidate sampling and target-selection theory page, the RRI metric API, the VIN model API, and the current ASE/EVL stack [1], [2]. The thesis builds on VIN-NBV’s quality-driven candidate-ranking objective [3], while treating GenNBV [4], Hestia [5], and SceneScript [6] as comparison and extension references.

1.1 Objectives

The advisor-facing contract is deliberately compact:

validate the ASE offline-store, geometry, candidate-label, invalidity, and oracle-RRI contracts before scale-up;
define target-specific RRI under V1 OBS-SEL / PRED-Q / GT-EVAL;
train and validate the learned one-step target-conditioned scorer as the required myopic baseline and control for planning, not as a standalone research question;
validate finite candidate mixtures, branch/beam rollout support, and stochastic rollout-data diversity before the first $Q_H$ training run;
train $Q_{H,\theta}$ as a finite-candidate value model, first implemented as a candidate-to-state query Transformer over target, ray-aware map, local EVL, history, budget, and candidate tokens;
treat scale as an RQ4 support and evidence protocol: move from small trusted subsets toward ASE-wide and external mesh/oracle-compatible settings without replacing target RRI by proxy objectives;
treat online discrete $Q_H$ as RQ5 after stable offline $Q_H$, and continuous target-then-pose actor-critic as RQ6 after online finite-candidate evidence;
evaluate all learned selected actions with the oracle;
report scenes, snippets, targets, trajectories, rollout seeds, transitions, invalid gaps, and coverage gaps separately as shared evidence constraints for empirical claims.

1.2 Definitions

The core terms used by the research questions are maintained in the shared glossary, which is generated from docs/typst/shared/glossary.typ for Quarto, Typst, and KG ingestion:

Related implementation nodes remain VINv3, EVL literature notes, and candidate generation.

1.3 Thesis Boundary

The thesis should remain explicit about what is already implemented, what must be completed before scale-up, what is the hard quantitative thesis core, and what remains lower-priority escalation or future work. This boundary protects the proposal and final thesis from presenting ARIA-NBV as a finished continuous-control or real-device RL policy before the evidence exists.

Status	Work included
Already implemented substrate	Scene-level oracle RRI, immutable VIN offline-store paths, VINv3-style one-step scoring, candidate generation, rollout scaffolding, rollout Zarr record/store path, and Rerun offline inspection.
Prerequisite and evidence protocol	Proposal freeze, M1 contract report, V0/V1 target contracts, candidate-label/frame guards, invalidity masks/reasons, rollout/Q storage schema, LRZ deterministic sharding/storage gates, scene-level splits, coverage reporting, and ablation reporting.
Hard quantitative thesis core	Oracle target-task sampling for rollout/data-generation labels; target-conditioned learned one-step scorer as the myopic control; mixed target-aware candidate sets; random-valid, oracle-greedy/lookahead, and oracle-scored temperature-softmax rollouts; finite-candidate $Q_{H,\theta}$ value model, first implemented as candidate-to-state queries; explicit scale reporting.
RQ4 support and scale research	Candidate/rollout support, ASE-wide finite-candidate offline scale, and external mesh/oracle-compatible expansion only when target-RRI supervision is preserved.
RQ5 online bridge	Online discrete $Q_H$ over the same finite-candidate action contract after offline $Q_H$ is trusted.
RQ6 lower-priority escalation	Continuous target-then-pose actor-critic and simulator-backed control after online finite-candidate evidence exists; quantitative continuous-control results are time-permitting, not required for the finite-candidate thesis result.
Future-work extensions	Gumbel-Top-k as preferred later rollout-diversity evidence when time permits, SceneScript-style semantic memory, VLM/global planning, real-device guidance, human-in-the-loop AR, imitation-learning variants beyond planned RRI + $Q_H$, and proxy-objective comparisons.

1.4 Proposal Freeze Gate

Before M1 scale-up, the proposal, roadmap, research questions, and literature pages must state the same advisor-facing contract:

ARIA-NBV uses target-specific RRI as the thesis utility signal, with scene RRI and acquisition cost reported separately;
oracle target-task sampling owns the first rollout/data-generation target pools and labels; V1 OBS-SEL / PRED-Q / GT-EVAL is mandatory before making actor-visible deployable-input claims;
target-conditioned finite-candidate $Q_H$ value learning is a hard thesis deliverable after the M1/M3/M4 gates, with candidate-to-state cross-attention as the first implementation and candidate-candidate self-attention only as an ablation; any blocker must be documented explicitly;
scaling beyond the current ASE mesh subset must preserve mesh/oracle target-RRI supervision for thesis-grade evidence; proxy objectives remain contrast or discussion signals;
RQ5 online discrete $Q_H$ is the first bridge after stable offline finite-candidate evidence; RQ6 continuous target-then-pose actor-critic is a later headroom test, not a substitute for the M5 $Q_H$ result;
proposal-critical citations must resolve through docs/references.bib to primary paper, dataset, or library metadata rather than generated comments or Wikipedia references.

1.5 RQ1 - Method Objective and Reward Contract

Question. Can ARIA-NBV learn target-conditioned, finite-candidate, RRI-based multi-step NBV in complex indoor ASE scenes, improving endpoint quality of a selected target under a fixed view budget while keeping the training return, endpoint evaluation metric, and acquisition costs separate?

Working hypothesis. Endpoint target-quality gain should be the main equal-budget trajectory metric. The default finite-horizon return for rollout ranking and $Q_H$ training is root-normalized additive target gain, while state-relative target RRI remains a one-step diagnostic and VIN-compatible label. Log target-error gain and scalarized motion or feasibility costs are explicit ablations after the quality-only baseline is stable, not silent replacements for the root-normalized reward.

For a rollout rooted at state $s_0$, target $e$, horizon $H$, and valid action sequence $\tau=(a_0,\ldots,a_{H-1})$, let $q_t=q_{t,a_t}$ be the selected candidate view and $\mathcal{P}_{q_t}$ its fused candidate geometry. The accumulated counterfactual point state is \[ \mathcal{P}_t = \mathcal{P}_0\cup\bigcup_{k=0}^{t-1}\mathcal{P}_{q_k}, \qquad t=1,\ldots,H, \] so $\mathcal{P}_H=\mathcal{P}_0\cup\bigcup_{k=0}^{H-1}\mathcal{P}_{q_k}$. Let $C_e(\mathcal{P})$ denote the oracle-only crop operator that filters an accumulated point set to the matched target region used for target-RRI labels and evaluation. For V1 actor inputs, this crop is not actor-visible. The target-cropped point-mesh oracle error is \[ \Delta_t^e= d(C_e(\mathcal{P}_t),M_e) = D_{\mathcal{P}\to M,t}^e + D_{M\to\mathcal{P},t}^e, \] where $D_{\mathcal{P}\to M,t}^e$ is point-to-mesh accuracy and $D_{M\to\mathcal{P},t}^e$ is mesh-to-point completeness for the cropped points and the matched target-specific ground-truth surface. The target surface is used only for oracle supervision and evaluation.

The endpoint metric is \[ J_{e,\Delta}^{(H)}(\tau)= \frac{\Delta_0^e-\Delta_H^e}{\Delta_0^e+\varepsilon}. \] Conceptually, $J_{e,\Delta}^{(H)}$ is the fraction of the initial target error removed after $H$ selected views. It is relative to the rollout root $\mathcal{P}_0$ and is comparable only under a fixed horizon or fixed acquisition budget.

The default immediate rollout/Q_H reward is root-normalized against the initial target error: \[ r_{t,\mathrm{root}}^e = \frac{\Delta_t^e-\Delta_{t+1}^e}{\Delta_0^e+\varepsilon}. \] The finite-horizon learning return is \[ G_0^{(H)}(\tau)= \sum_{t=0}^{H-1}\gamma^t r_{t,\mathrm{root}}^e. \] With $\gamma=1$, this cumulative reward equals endpoint gain up to numerical epsilon. The return is used for rollout ranking and Bellman-style training; endpoint gain is still reported as the fixed-budget evaluation metric. The state-relative one-step RRI $(\Delta_t^e-\Delta_{t+1}^e)/(\Delta_t^e+\varepsilon)$ remains stored as a diagnostic and VIN compatibility label, not as the default $Q_H$ reward. Valid actions may have negative reward if they worsen target distance; invalid candidates are hard-masked constraints, not low-RRI examples.

The log-gain companion remains an explicit ablation: \[ L_e^{(H)}(\tau)= \log(\Delta_0^e+\varepsilon)-\log(\Delta_H^e+\varepsilon) = \sum_{t=0}^{H-1} \left[ \log(\Delta_t^e+\varepsilon) - \log(\Delta_{t+1}^e+\varepsilon) \right]. \] This telescoping form may reduce stage-scale sensitivity, but it is not bounded to $[0,1]$ and is sensitive to $\varepsilon$ for near-solved targets. The exact $\gamma$, $\varepsilon$, clipping policy, near-solved-target eligibility rule, and scalarized ablation such as $G_0^{(H)}(\tau)-\lambda C(\tau)$ remain supervisor-facing open details.

Evaluation nodes. Report endpoint target-quality gain, cumulative target-root gain, diagnostic target RRI, optional log target-error gain, scene RRI, view count, path length, invalid-action rate, and runtime. Different view budgets should be compared by fixed-budget tables or quality-cost curves rather than raw endpoint gain alone.

1.6 RQ2 - Offline Finite-Candidate Q_H Planning

Question. How does the trained finite-candidate value model $Q_{H,\theta}$, first implemented as candidate-to-state query attention, perform against learned one-step target scoring, oracle greedy/lookahead planning, and oracle-scored stochastic rollout references under the same acquisition and candidate budgets?

Working hypothesis. Bounded planning should improve target endpoint gain especially when the locally best target view is not the best first step, for example around occlusion, future invalidity, history-dependent candidate regeneration, or when the target becomes visible after an intermediate move. M5 therefore first estimates oracle-lookahead headroom:

\[ \Delta_{\mathrm{look}} = J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}look}}) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}1}}). \]

Here $\pi_{\mathrm{oracle\text{-}look}}$ selects the first action of a bounded $H$-step oracle search that maximizes the same root-normalized return used for $Q_H$ training; endpoint gain evaluates the resulting trajectory. If $\Delta_{\mathrm{look}}\approx 0$, the thesis reports no measurable non-myopic headroom for the evaluated candidate distribution, horizon, branch factor, target set, and split. This is a setup-specific negative result, not a general claim that target-specific RRI is myopic. If $\Delta_{\mathrm{look}}>0$, $Q_{H,\theta}$ should learn enough bounded return structure from ASE oracle rollout traces to recover measurable headroom over the learned one-step target-conditioned scorer.

Q_H formulation. $Q_H$ is the glossary-level concept; the learned network in the thesis is the finite-candidate value model $Q_{H,\theta}(s_t^{\mathrm{cf0}},z_e,q_{t,i})$, first implemented as compact candidate-to-state query attention. The first actor-visible rollout state is $s_t^{\mathrm{cf0}}=(F_0^{\mathrm{EVL}},\mathcal{P}_t^{\mathrm{semi/fused}},M_t^{\mathrm{ray}},F_t^{\mathrm{DINO@pt}},\mathcal{Q}_t,m_t,\rho_t,z_e,h_t,b_t)$: root local EVL context, accumulated semidense/fused counterfactual points, a ray-aware occupied/free/unknown memory updated by selected geometry, an optional visibility-gated logged-frame DINO-on-point feature bank, finite candidate table, validity mask, invalid-reason metadata, actor-visible target descriptor, selected-view history, and remaining budget. The first $Q_{H,\theta}$ result may start with the implemented VIN/EVL heads plus semidense geometry only; the first planned representation upgrade is ray-aware geometry and candidate queries, while $F_t^{\mathrm{DINO@pt}}$ is a later appearance ablation, not required evidence before the baseline. GT meshes, GT crops, and oracle RRI values are supervision and evaluation only. The model applies hard validity masks and emits one bounded-horizon value per finite candidate-table action in continuous return units; CORAL remains scoped to the one-step scorer unless an explicit ordinal ablation is reported. The formal ArgTopK recursion, branch-factor schedule, replay row fields, and DQN/Double-DQN backup shapes live on the finite-candidate rollout and Q_H contract, while ArgTopK, beam, and stochastic branching knobs are treated as RQ4 rollout-support controls.

Required comparisons.

random-valid selection as a lower reference;
one-step oracle greedy and bounded oracle-RRI lookahead under equal candidate and view budget;
learned one-step target-conditioned scorer as the required myopic model control;
oracle-scored temperature-softmax traces for rollout data diversity;
Gumbel-Top-k traces as preferred later diversity evidence when schedule permits;
$Q_{H,\theta}$ trained from ASE oracle rollout traces, with all selected actions oracle-evaluated.

Gate ordering. Deterministic oracle-greedy and bounded oracle-lookahead rollouts must be trusted before stochastic branching is used as training data. Random-valid, oracle-greedy/lookahead, and oracle-scored temperature-softmax traces are mandatory before the first $Q_{H,\theta}$ training run. Gumbel-Top-k, CQL [7], BCQ [8], Decision Transformer [9], and IQL [10] remain later evidence or ablation references unless the M5 result is stable.

One-step scorer evidence gate. Before $Q_{H,\theta}$ is interpreted as a planning result, the learned one-step target-conditioned scorer must have held-out ranking evidence, oracle-evaluated model-selected rollouts, calibration and stage-shift diagnostics, and Rerun visualizations of representative successes and failures. This scorer is not the final thesis claim, but it is the required myopic control.

Success bar. Learned $Q_{H,\theta}$ selected actions must be oracle-evaluated under equal acquisition and candidate budgets. The primary comparison is endpoint target-quality gain; the secondary training-aligned comparison is cumulative root-normalized target gain. When $\Delta_{\mathrm{look}}>0$, $Q_{H,\theta}$ should beat the learned one-step target-conditioned scorer and one-step model/greedy selection, while bounded oracle lookahead remains the upper reference and random-valid remains the lower reference. Also report the recovered headroom fraction:

\[ \eta_Q = \frac{ J_{e,\Delta}^{(H)}(\pi_Q) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{learned\text{-}1}}) }{ J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}look}}) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{learned\text{-}1}}) +\varepsilon }. \]

Report $\eta_Q$ only when the denominator is positive and larger than a predefined minimum effect threshold; otherwise report raw endpoint gains and classify the comparison as no measurable headroom. The exact effect-size threshold remains an advisor-facing open decision.

1.7 RQ3 - Actor-Visible Target Representation

Question. Which actor-visible target descriptor and matching protocol are sufficient for target-conditioned one-step scoring and $Q_{H,\theta}$ without leaking ground-truth target annotations into the actor input?

Working hypothesis. RQ3 owns target inputs, not the Q-network architecture. The first path should use observed/predicted OBB geometry plus support signals. A compact actor-visible crop descriptor is the first planned ablation and may be promoted if it is cheap and leakage-safe. The strongest near-term crop descriptor candidate is semidense/fused target support with a target/context shell and ray-aware support; compressed DINO-on-point features are added only after logged visibility gating works. Entity tokens remain later ablations.

Main protocol. The thesis protocol is OBS-SEL / PRED-Q / GT-EVAL. Target selection and model inputs use actor-visible predicted or observed descriptors. GT target crops and GT OBBs define labels and evaluation only. V0 may use GT OBB input for sanity or upper-bound runs; oracle target-task sampling remains the first rollout label path, and V1 is mandatory before deployable actor-input claims.

Actor-visible target hypotheses are sourced from the EFM3D/EVL stack or the observed support derived from it: predicted or tracked OBBs, semantic class probabilities, confidence-style scores, projected area, voxel support, semidense/fused point support, ray-aware target/context support, and optional visibility-gated logged DINO-on-point descriptors. Cube R-CNN-style detections can be evaluated as auxiliary target proposals or ROI descriptors after ARIA/ASE adaptation; they are not the default scene memory.

First target input. The minimum first-path descriptor is the observed/predicted OBB center, extents, orientation, class, confidence, projected area, relative pose, semidense point support, and EVL support. The first ablation adds a compact crop descriptor computed only from actor-visible spatial data inside the observed/predicted crop, such as pooled EVL/voxel features or point statistics. GT crops, GT mesh geometry, and GT OBBs are never used to build V1 actor-visible descriptors.

Descriptor source	Role
GT OBB	V0 sanity or upper-bound input only; never the main V1 actor input.
Predicted/observed OBB	Mandatory V1 actor-visible target record.
Crop descriptor	First V1 ablation: pooled EVL/voxel features, semidense/fused point statistics, ray-aware target/context support, or visibility-gated logged DINO-on-point descriptors inside the predicted/observed crop.
Entity token	Later learned target/entity embedding ablation.
Target point	Candidate-generation look-at point, such as OBB center, support centroid, missing-surface centroid, or expected-RRI centroid.

Descriptor source and descriptor encoding stay separate. Encoders may expose relative target translation, extents, 6D/R6D-style relative orientation, class-probability or semantic embeddings, projected-area/support scalars, and Fourier-feature or MLP encoders for continuous geometry. Crop descriptors enter only from actor-visible spatial evidence when the OBB-level contract is stable.

First target metric. The first target-aware oracle metric is GT-OBB-cropped target RRI. Main thesis claims require graduating from V0 GT target input to V1 observed/predicted target input before target-conditioned scorer or $Q_{H,\theta}$ results are reported as actor-visible.

Matching rule. Observed targets are matched to GT target labels by compatible class, OBB IoU, visibility/support, projected area, and semidense or EVL point support. Let $\mu(\hat e,e)$ denote this compact match score between an actor-visible target record $\hat e$ and a GT target candidate $e$:

\[ e^\star=\arg\max_{e\in\mathcal E}\mu(\hat e,e), \qquad \mu_1=\mu(\hat e,e^\star), \qquad \mu_2=\max_{e\in\mathcal E\setminus\{e^\star\}}\mu(\hat e,e). \]

A match is accepted iff

\[ \mu_1\ge \tau_\mu \qquad\text{and}\qquad \mu_1-\mu_2\ge \tau_{\mathrm{gap}}. \]

Unmatched targets, unsupported targets, and ambiguous multi-object matches are target-invalid protocol cases, not low target-RRI examples. Any criterion beyond compatible class, OBB IoU, visibility/support, projected area, and point support must be named from an observed M3 failure mode.

Connected implementation surfaces. Target fields should enter through data-handling containers, then flow into the VIN model API, $Q_{H,\theta}$ replay rows, and diagnostics rather than through ad hoc dictionaries.

1.8 RQ4 - Candidate, Rollout, and Scale Support

Question. Do mixed target-centric and default exploration candidates, controlled branch/beam rollout support, and scale-controlled replay evidence improve target-conditioned NBV relative to pure generic sampling and pure target-point sampling?

Working hypothesis. A mixed sampler should beat both a pure generic shell and a pure TARGET_POINT sampler when target viewpoints need diversity: target-centric candidates focus on the object or region of interest, while default candidates preserve exploratory views and reduce overfitting to a single look-at pattern.

Candidate families. The thesis-core vocabulary is the package-supported finite candidate table exposed by CandidateViewGeneratorConfig, SamplingStrategy, the code-level ViewDirectionMode enum in aria_nbv.pose_generation.types, and CandidateSamplingResult:

TARGET_POINT candidates that look at a predicted/observed target point or OBB center;
RADIAL_AWAY and RADIAL_TOWARDS shell candidates around the current reference pose for default exploration and augmentation;
FORWARD_RIG candidates that preserve trajectory-forward view direction;
position sampling with UNIFORM_SPHERE and FORWARD_POWERSPHERICAL;
bounded yaw, pitch, and roll jitter superimposed on the chosen base orientation;
categorical-probability mixtures over those supported families, with candidate strategy provenance retained per sampled row.

Frontier, missing-surface, expected-RRI heatmap, and GT look-at samplers are not first-path thesis candidates unless they are implemented, validated, and clearly marked as upper-bound or stretch evidence. Target-centric candidate generation uses the V1 observed/predicted target record unless the run is explicitly marked as V0 or upper-bound.

Evaluation nodes. Candidate realism, valid fraction, invalid reason distribution, target visibility, target RRI distribution, scene RRI distribution, selected-view diversity, and strategy provenance for every candidate set. Hard masks and reason codes are required before any scorer or $Q_{H,\theta}$ consumes the table.

Rollout-support knobs. Branch factor, beam width, and stochastic branch selection are dataset-support controls for $Q_{H,\theta}$, not replacement objectives. Deterministic ArgTopK expands a bounded set of valid high-score candidates; oracle-scored temperature-softmax and later Gumbel-Top-k make the branching stochastic to widen rollout support without changing the target-RRI label. These knobs belong with candidate/action-space evidence because they decide which finite action rows the value model can learn from.

Scale-support ladder. RQ4 also owns support growth before RQ5 online interaction:

small trusted subset for geometry, target matching, invalidity, and replay correctness;
scene-level held-out subset or full ASE GT-mesh subset with exact reporting of scenes, snippets, targets, trajectories, rollout seeds, transitions, and missing coverage gaps;
broader external mesh/oracle-compatible datasets or simulator substrates only if they preserve target-specific point-mesh supervision and actor-visible target inputs.

The implementation budget and concrete compute envelope are internal planning details. Public evidence must report achieved coverage, missing coverage, split boundaries, oracle throughput, and whether comparisons reuse identical roots, candidates, and budgets.

1.9 Shared Scale Evidence Beyond Small Trusted Subsets

Protocol question. Which scaling path preserves target-RRI supervision while moving from small trusted subsets to thesis-grade evidence and, if needed, beyond the 100 mesh-supervised ASE scene limit?

Working hypothesis. Scaling should proceed in causal order. First, scale finite-candidate offline labels, rollout roots, targets, candidates, and transitions inside the ASE mesh-supervised subset. Second, if the 100 GT-mesh scene limit becomes the bottleneck, expand only to external substrates that can provide comparable mesh/oracle target-RRI labels. Third, hand off only then to RQ5 online discrete $Q_H$ interaction to test whether additional oracle interaction improves the finite-candidate policy. Coverage, uncertainty, or semantic proxy labels are valid contrast signals, but they do not replace mesh/oracle target RRI for thesis-grade comparisons.

Scale ladder.

small trusted subset for geometry, target matching, invalidity, and replay correctness;
scene-level held-out subset or full ASE GT-mesh subset with exact reporting of scenes, snippets, targets, trajectories, rollout seeds, transitions, and missing coverage gaps;
broader external mesh/oracle-compatible datasets or simulator substrates if they preserve target-specific point-mesh supervision and actor-visible target inputs;
RQ5 online discrete $Q_H$ in the ASE or external mesh/oracle loop after offline finite-candidate evidence is stable.

The implementation budget and concrete compute envelope are internal planning details. Public evidence must instead report achieved coverage, missing coverage, split boundaries, oracle throughput, and whether comparisons reuse identical roots, candidates, and budgets. Scale axes are reported separately: scenes, snippets, anchor poses per trajectory, candidate sets, candidate-distribution variants, targets per snippet, rollout seeds, transitions, and stage/calibration bins. Architecture, rollout, scorer, or $Q_H$ conclusions must not be compared across runs that silently change scene, snippet, target, or candidate coverage.

1.10 RQ5 - Online Discrete Q_H Bridge

Question. After the planned offline finite-candidate $Q_H$ result, does online interaction over the same discrete finite-candidate action contract improve endpoint gain or recovered headroom enough to justify the extra data collection loop?

Working hypothesis. Online discrete $Q_H$ is the first post-offline bridge because it preserves the finite action table, hard masks, target-RRI reward, and oracle re-evaluation contract. It is an RQ5 extension after stable offline $Q_H$, not a replacement for the offline evidence gate.

Online-discrete gate.

online discrete $Q_H$ with the same finite action contract and oracle re-evaluation;
compare offline fitted $Q_H$ against online-updated finite-candidate policy under matched roots, targets, candidates, and acquisition budget;
report whether online interaction changes endpoint gain, cumulative root-normalized target gain, invalid-action handling, or recovered headroom.

1.11 RQ6 - Continuous and Simulator Escalation

Question. After finite-candidate and online-discrete evidence, do continuous or hierarchical target-then-pose policies have measurable headroom over the best finite-candidate policy under the same target-RRI objective?

Working hypothesis. Continuous action spaces require online training first. Continuous policies are plausible improvements over simple algorithmic candidate generation because they are not restricted to the hand-designed finite candidate table, but they should not be used to rescue a failed target-RRI, rollout, offline $Q_H$, or online-discrete contract. Hestia-style hierarchy [5] and GenNBV-style continuous control [4] are relevant design references. Imitation-learning variants are deferred beyond the planned RRI + $Q_H$ approach.

Escalation ladder.

continuous target-then-pose actor-critic only after online interaction, reward speed, invalid-action handling, and evaluation are thesis-grade;
simulator-backed online RL with Habitat, Isaac, or external datasets only if the mesh/oracle target-RRI supervision contract can be preserved;
optional semantic-global planning over grounded portals, frontiers, entities, and regions with SceneScript-style or open-vocabulary entity memory [6] and real-device guidance as future work.

1.12 Shared Evidence and Protocol Constraints

Invalidity remains a shared constraint that makes every empirical answer interpretable. Scale support is part of RQ4, but coverage reporting remains a shared evidence contract across all empirical RQs.

Invalidity protocol. Collision, out-of-bounds poses, no-depth cases, bad frusta, outside-EVL-extent cases, and candidates that cannot produce a meaningful oracle/evaluation sample remain hard masks plus explicit reason codes. Low immediate target visibility or support is a diagnostic or label-quality flag unless evaluation is impossible; otherwise it would mask valid setup actions and make non-myopic planning myopic by construction. Invalid cases are never encoded as the lowest RRI bin. $Q_{H,\theta}$ masks invalid candidates before argmax, softmax, loss targets, and bootstrap maximization. Report invalid fraction, invalid-reason distribution, target-visible fraction, rank metrics with and without masks, and the effect of mask handling on selected-action oracle evaluation. Whether invalidity should become a learned signal through validity heads, scalar penalties, or continuous-action feasibility models remains a supervisor-facing escalation decision.

Coverage protocol. Every empirical claim reports the actual mesh-supervised coverage used: scenes, snippets, targets, trajectories, anchor poses, candidates, rollout seeds, transitions, split boundaries, invalid gaps, and missing coverage gaps. The final target remains full 100 GT-mesh ASE scenes and 4,608 snippet windows when feasible; a scene-level held-out subset is acceptable only with exact coverage reporting and scene-level train/validation/test boundaries. Sample-level splitting across snippets from the same scene is not acceptable for final claims.

Equal-budget protocol. Unless stated otherwise, equal budget means equal selected-view horizon $H$, equal candidate count $N_q$ per decision step, equal candidate-generation distribution, and matched validity constraints. Path length, runtime, and oracle evaluation count are reported separately; path/time-constrained variants are explicit ablations.

Storage and scale protocol. The Zarr-first rollout/Q store should avoid duplicating raw ASE/ATEK assets: full meshes remain external path/hash/version references, high-detail target crops are stored once per target with crop metadata, and rollout rows reference those assets. LRZ deterministic sharding, Slurm/DSS staging, resume-safe writes, and storage-budget reporting are hard gates before full-scale generation.

Ablation protocol. Required ablation axes are target representation, candidate sampler mixture and rollout-support branching, one-step scorer versus $Q_{H,\theta}$, invalid-mask handling, surface reconstruction input, CORAL variant, auxiliary regression, candidate-relative pose encoding, stage-aware calibration, and storage/scale gates. Rollout and $Q_H$ data should use the Zarr-first rollout/Q store before large-scale generation.

1.13 Research Matrix

Question	Roadmap gate	Primary evidence	Main linked surface
RQ1	M1, M5	endpoint target-quality gain, cumulative target-root gain, diagnostic target RRI, acquisition-cost curves	RRI theory
RQ2	M4, M5	one-step scorer evidence gate, oracle lookahead, rollout traces, endpoint gain, cumulative target-root gain, diagnostic target RRI, and $Q_{H,\theta}$ success bar	Rollout/Q_H contract
RQ3	M3, M4	actor-visible target encoding and crop-descriptor ablation	EVL notes
RQ4	M2 to M7	mixed candidate sampler, branch/beam support, stochastic rollout diversity, validity statistics, ASE scale, external mesh/oracle-compatible expansion, and coverage reports	Candidate API
RQ5	M6	online discrete Q_H over the same finite-candidate action contract after offline $Q_H$ is trusted	Roadmap M6
RQ6	M6	time-permitting continuous target-then-pose policy, hierarchy, or simulator escalation after finite-candidate and online-discrete evidence	Roadmap M6
Protocol	M1 to M7	invalidity masks/reason metrics, scene-level splits, no sample leakage, coverage reports, calibration/stage-shift diagnostics, Zarr/LRZ gates, and ablation tables	M1 contract report

1.14 KG-Friendly Question Writing

Each research question is written as a graph node with:

a stable section anchor;
a one-sentence definition or hypothesis;
aliases when terminology can drift;
links to internal docs, API surfaces, and roadmap gates;
citations through docs/references.bib when a question depends on prior work.

New thesis docs and public docstrings should follow the same pattern so later KG extraction can connect concepts, code, experiments, and evidence without inferring relationships from prose alone.

References

[1]

Meta Platforms Inc., “Aria synthetic environments dataset.” [Online]. Available: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset

[2]

J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224

[3]

N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219

[4]

X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang, “GenNBV: Generalizable next-best-view policy for active 3D reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16436–16445. Available: https://openaccess.thecvf.com/content/CVPR2024/html/Chen_GenNBV_Generalizable_Next-Best-View_Policy_for_Active_3D_Reconstruction_CVPR_2024_paper.html

[5]

C.-Y. Lu et al., “Hestia: Voxel-face-aware hierarchical next-best-view acquisition for efficient 3D reconstruction,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2026. Available: https://openaccess.thecvf.com/content/WACV2026/papers/Lu_Hestia_Voxel-Face-Aware_Hierarchical_Next-Best-View_Acquisition_for_Efficient_3D_Reconstruction_WACV_2026_paper.pdf

[6]

A. Avetisyan et al., “SceneScript: Reconstructing scenes with an autoregressive structured language model.” 2024. Available: https://arxiv.org/abs/2403.13064

[7]

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” in Advances in neural information processing systems, 2020, pp. 1179–1191. Available: https://papers.nips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html

[8]

S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proceedings of the 36th international conference on machine learning, in Proceedings of machine learning research, vol. 97. PMLR, 2019, pp. 2052–2062. Available: https://proceedings.mlr.press/v97/fujimoto19a.html

[9]

L. Chen et al., “Decision transformer: Reinforcement learning via sequence modeling,” in Advances in neural information processing systems, 2021, pp. 15084–15097. Available: https://papers.nips.cc/paper_files/paper/2021/hash/7f489f642a0ddb10272b5c31057f0663-Abstract.html

[10]

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning.” 2021. Available: https://arxiv.org/abs/2110.06169

--- title: "Master Thesis Research Questions" phase: thesis audience: advisor status: current owner: jan format: html --- # Master Thesis Research Questions {#research-questions} This page states the research questions for the master's thesis phase of **ARIA-NBV**. The central question is: > Can ARIA-NBV perform target-conditioned, RRI-based **multi**-step NBV by training > a finite-candidate value model $Q_{H,\theta}$ that predicts bounded > cumulative target-specific RRI for a target of interest and improves endpoint > target reconstruction quality, after first measuring oracle-lookahead > headroom over one-step selection under a fixed acquisition budget? The questions are tied to the dated [roadmap](roadmap.qmd), the [RRI theory page](../theory/rri_theory.qmd), the [finite-candidate rollout and Q_H contract](../theory/rl_planning.qmd), the [candidate sampling and target-selection theory page](../theory/candidate_sampling_target_selection.qmd), the [RRI metric API](../../reference/rri_metrics.qmd), the [VIN model API](../../reference/vin.model_v3.qmd), and the current ASE/EVL stack [@ProjectAria-ASE-2025; @EFM3D-straub2024]. The thesis builds on VIN-NBV's quality-driven candidate-ranking objective [@VIN-NBV-frahm2025], while treating GenNBV [@GenNBV-chen2024], Hestia [@Hestia-lu2026], and SceneScript [@SceneScript-avetisyan2024] as comparison and extension references. ## Objectives {#rq-objectives} The advisor-facing contract is deliberately compact: - validate the ASE offline-store, geometry, candidate-label, invalidity, and oracle-RRI contracts before scale-up; - define target-specific RRI under V1 {{< gls observed-target-selection >}} / {{< gls predicted-target-q >}} / {{< gls ground-truth-target-evaluation >}}; - train and validate the learned one-step target-conditioned scorer as the required myopic baseline and control for planning, not as a standalone research question; - validate finite candidate mixtures, branch/beam rollout support, and stochastic rollout-data diversity before the first $Q_H$ training run; - train $Q_{H,\theta}$ as a finite-candidate value model, first implemented as a candidate-to-state query Transformer over target, ray-aware map, local EVL, history, budget, and candidate tokens; - treat scale as an RQ4 support and evidence protocol: move from small trusted subsets toward ASE-wide and external mesh/oracle-compatible settings without replacing target RRI by proxy objectives; - treat online discrete $Q_H$ as RQ5 after stable offline $Q_H$, and continuous target-then-pose actor-critic as RQ6 after online finite-candidate evidence; - evaluate all learned selected actions with the oracle; - report scenes, snippets, targets, trajectories, rollout seeds, transitions, invalid gaps, and coverage gaps separately as shared evidence constraints for empirical claims. ## Definitions {#rq-definitions} The core terms used by the research questions are maintained in the shared [glossary](../glossary.qmd), which is generated from `docs/typst/shared/glossary.typ` for Quarto, Typst, and KG ingestion: - {{< glsfull target-of-interest >}} - {{< glsfull target-specific-rri >}} - {{< glsfull acquisition-cost >}} - {{< glsfull target-conditioned-scorer >}} - {{< glsfull finite-horizon-q-function >}} - {{< glsfull validity-mask >}} - {{< glsfull observed-target-selection >}} - {{< glsfull predicted-target-q >}} - {{< glsfull ground-truth-target-evaluation >}} Related implementation nodes remain [VINv3](../../reference/aria_nbv.vin.model_v3.VinModelV3.qmd), [EVL literature notes](../literature/efm3d.qmd), and [candidate generation](../../reference/aria_nbv.pose_generation.CandidateViewGenerator.qmd). ## Thesis Boundary {#rq-thesis-boundary} The thesis should remain explicit about what is already implemented, what must be completed before scale-up, what is the hard quantitative thesis core, and what remains lower-priority escalation or future work. This boundary protects the proposal and final thesis from presenting ARIA-NBV as a finished continuous-control or real-device RL policy before the evidence exists. | Status | Work included | |---|---| | Already implemented substrate | Scene-level oracle RRI, immutable VIN offline-store paths, VINv3-style one-step scoring, candidate generation, rollout scaffolding, rollout Zarr record/store path, and Rerun offline inspection. | | Prerequisite and evidence protocol | Proposal freeze, M1 contract report, V0/V1 target contracts, candidate-label/frame guards, invalidity masks/reasons, rollout/Q storage schema, LRZ deterministic sharding/storage gates, scene-level splits, coverage reporting, and ablation reporting. | | Hard quantitative thesis core | Oracle target-task sampling for rollout/data-generation labels; target-conditioned learned one-step scorer as the myopic control; mixed target-aware candidate sets; random-valid, oracle-greedy/lookahead, and oracle-scored temperature-softmax rollouts; finite-candidate $Q_{H,\theta}$ value model, first implemented as candidate-to-state queries; explicit scale reporting. | | RQ4 support and scale research | Candidate/rollout support, ASE-wide finite-candidate offline scale, and external mesh/oracle-compatible expansion only when target-RRI supervision is preserved. | | RQ5 online bridge | Online discrete $Q_H$ over the same finite-candidate action contract after offline $Q_H$ is trusted. | | RQ6 lower-priority escalation | Continuous target-then-pose actor-critic and simulator-backed control after online finite-candidate evidence exists; quantitative continuous-control results are time-permitting, not required for the finite-candidate thesis result. | | Future-work extensions | Gumbel-Top-k as preferred later rollout-diversity evidence when time permits, SceneScript-style semantic memory, VLM/global planning, real-device guidance, human-in-the-loop AR, imitation-learning variants beyond planned RRI + $Q_H$, and proxy-objective comparisons. | ## Proposal Freeze Gate {#rq-proposal-freeze-gate} Before M1 scale-up, the proposal, roadmap, research questions, and literature pages must state the same advisor-facing contract: - ARIA-NBV uses target-specific RRI as the thesis utility signal, with scene RRI and acquisition cost reported separately; - oracle target-task sampling owns the first rollout/data-generation target pools and labels; V1 OBS-SEL / PRED-Q / GT-EVAL is mandatory before making actor-visible deployable-input claims; - target-conditioned finite-candidate $Q_H$ value learning is a hard thesis deliverable after the M1/M3/M4 gates, with candidate-to-state cross-attention as the first implementation and candidate-candidate self-attention only as an ablation; any blocker must be documented explicitly; - scaling beyond the current ASE mesh subset must preserve mesh/oracle target-RRI supervision for thesis-grade evidence; proxy objectives remain contrast or discussion signals; - RQ5 online discrete $Q_H$ is the first bridge after stable offline finite-candidate evidence; RQ6 continuous target-then-pose actor-critic is a later headroom test, not a substitute for the M5 $Q_H$ result; - proposal-critical citations must resolve through `docs/references.bib` to primary paper, dataset, or library metadata rather than generated comments or Wikipedia references. ## RQ1 - Method Objective and Reward Contract {#rq1-objective} **Question.** Can ARIA-NBV learn target-conditioned, finite-candidate, RRI-based multi-step NBV in complex indoor ASE scenes, improving endpoint quality of a selected target under a fixed view budget while keeping the training return, endpoint evaluation metric, and acquisition costs separate? **Working hypothesis.** Endpoint target-quality gain should be the main equal-budget trajectory metric. The default finite-horizon return for rollout ranking and $Q_H$ training is root-normalized additive target gain, while state-relative target RRI remains a one-step diagnostic and VIN-compatible label. Log target-error gain and scalarized motion or feasibility costs are explicit ablations after the quality-only baseline is stable, not silent replacements for the root-normalized reward. For a rollout rooted at state $s_0$, target $e$, horizon $H$, and valid action sequence $\tau=(a_0,\ldots,a_{H-1})$, let $q_t=q_{t,a_t}$ be the selected candidate view and $\mathcal{P}_{q_t}$ its fused candidate geometry. The accumulated counterfactual point state is $$ \mathcal{P}_t = \mathcal{P}_0\cup\bigcup_{k=0}^{t-1}\mathcal{P}_{q_k}, \qquad t=1,\ldots,H, $$ so $\mathcal{P}_H=\mathcal{P}_0\cup\bigcup_{k=0}^{H-1}\mathcal{P}_{q_k}$. Let $C_e(\mathcal{P})$ denote the oracle-only crop operator that filters an accumulated point set to the matched target region used for target-RRI labels and evaluation. For V1 actor inputs, this crop is not actor-visible. The target-cropped point-mesh oracle error is $$ \Delta_t^e= d(C_e(\mathcal{P}_t),M_e) = D_{\mathcal{P}\to M,t}^e + D_{M\to\mathcal{P},t}^e, $$ where $D_{\mathcal{P}\to M,t}^e$ is point-to-mesh accuracy and $D_{M\to\mathcal{P},t}^e$ is mesh-to-point completeness for the cropped points and the matched target-specific ground-truth surface. The target surface is used only for oracle supervision and evaluation. The endpoint metric is $$ J_{e,\Delta}^{(H)}(\tau)= \frac{\Delta_0^e-\Delta_H^e}{\Delta_0^e+\varepsilon}. $$ Conceptually, $J_{e,\Delta}^{(H)}$ is the fraction of the initial target error removed after $H$ selected views. It is relative to the rollout root $\mathcal{P}_0$ and is comparable only under a fixed horizon or fixed acquisition budget. The default immediate rollout/Q_H reward is root-normalized against the initial target error: $$ r_{t,\mathrm{root}}^e = \frac{\Delta_t^e-\Delta_{t+1}^e}{\Delta_0^e+\varepsilon}. $$ The finite-horizon learning return is $$ G_0^{(H)}(\tau)= \sum_{t=0}^{H-1}\gamma^t r_{t,\mathrm{root}}^e. $$ With $\gamma=1$, this cumulative reward equals endpoint gain up to numerical epsilon. The return is used for rollout ranking and Bellman-style training; endpoint gain is still reported as the fixed-budget evaluation metric. The state-relative one-step RRI $(\Delta_t^e-\Delta_{t+1}^e)/(\Delta_t^e+\varepsilon)$ remains stored as a diagnostic and VIN compatibility label, not as the default $Q_H$ reward. Valid actions may have negative reward if they worsen target distance; invalid candidates are hard-masked constraints, not low-RRI examples. The log-gain companion remains an explicit ablation: $$ L_e^{(H)}(\tau)= \log(\Delta_0^e+\varepsilon)-\log(\Delta_H^e+\varepsilon) = \sum_{t=0}^{H-1} \left[ \log(\Delta_t^e+\varepsilon) - \log(\Delta_{t+1}^e+\varepsilon) \right]. $$ This telescoping form may reduce stage-scale sensitivity, but it is not bounded to $[0,1]$ and is sensitive to $\varepsilon$ for near-solved targets. The exact $\gamma$, $\varepsilon$, clipping policy, near-solved-target eligibility rule, and scalarized ablation such as $G_0^{(H)}(\tau)-\lambda C(\tau)$ remain supervisor-facing open details. **Evaluation nodes.** Report endpoint target-quality gain, cumulative target-root gain, diagnostic target RRI, optional log target-error gain, scene RRI, view count, path length, invalid-action rate, and runtime. Different view budgets should be compared by fixed-budget tables or quality-cost curves rather than raw endpoint gain alone. ## RQ2 - Offline Finite-Candidate Q_H Planning {#rq2-offline-qh} **Question.** How does the trained finite-candidate value model $Q_{H,\theta}$, first implemented as candidate-to-state query attention, perform against learned one-step target scoring, oracle greedy/lookahead planning, and oracle-scored stochastic rollout references under the same acquisition and candidate budgets? **Working hypothesis.** Bounded planning should improve target endpoint gain especially when the locally best target view is not the best first step, for example around occlusion, future invalidity, history-dependent candidate regeneration, or when the target becomes visible after an intermediate move. M5 therefore first estimates oracle-lookahead headroom: $$ \Delta_{\mathrm{look}} = J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}look}}) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}1}}). $$ Here $\pi_{\mathrm{oracle\text{-}look}}$ selects the first action of a bounded $H$-step oracle search that maximizes the same root-normalized return used for $Q_H$ training; endpoint gain evaluates the resulting trajectory. If $\Delta_{\mathrm{look}}\approx 0$, the thesis reports no measurable non-myopic headroom for the evaluated candidate distribution, horizon, branch factor, target set, and split. This is a setup-specific negative result, not a general claim that target-specific RRI is myopic. If $\Delta_{\mathrm{look}}>0$, $Q_{H,\theta}$ should learn enough bounded return structure from ASE oracle rollout traces to recover measurable headroom over the learned one-step target-conditioned scorer. **Q_H formulation.** $Q_H$ is the glossary-level concept; the learned network in the thesis is the finite-candidate value model $Q_{H,\theta}(s_t^{\mathrm{cf0}},z_e,q_{t,i})$, first implemented as compact candidate-to-state query attention. The first actor-visible rollout state is $s_t^{\mathrm{cf0}}=(F_0^{\mathrm{EVL}},\mathcal{P}_t^{\mathrm{semi/fused}},M_t^{\mathrm{ray}},F_t^{\mathrm{DINO@pt}},\mathcal{Q}_t,m_t,\rho_t,z_e,h_t,b_t)$: root local EVL context, accumulated semidense/fused counterfactual points, a ray-aware occupied/free/unknown memory updated by selected geometry, an optional visibility-gated logged-frame DINO-on-point feature bank, finite candidate table, validity mask, invalid-reason metadata, actor-visible target descriptor, selected-view history, and remaining budget. The first $Q_{H,\theta}$ result may start with the implemented VIN/EVL heads plus semidense geometry only; the first planned representation upgrade is ray-aware geometry and candidate queries, while $F_t^{\mathrm{DINO@pt}}$ is a later appearance ablation, not required evidence before the baseline. GT meshes, GT crops, and oracle RRI values are supervision and evaluation only. The model applies hard validity masks and emits one bounded-horizon value per finite candidate-table action in continuous return units; CORAL remains scoped to the one-step scorer unless an explicit ordinal ablation is reported. The formal ArgTopK recursion, branch-factor schedule, replay row fields, and DQN/Double-DQN backup shapes live on the [finite-candidate rollout and Q_H contract](../theory/rl_planning.qmd), while ArgTopK, beam, and stochastic branching knobs are treated as RQ4 rollout-support controls. **Required comparisons.** - random-valid selection as a lower reference; - one-step oracle greedy and bounded oracle-RRI lookahead under equal candidate and view budget; - learned one-step target-conditioned scorer as the required myopic model control; - oracle-scored temperature-softmax traces for rollout data diversity; - Gumbel-Top-k traces as preferred later diversity evidence when schedule permits; - $Q_{H,\theta}$ trained from ASE oracle rollout traces, with all selected actions oracle-evaluated. **Gate ordering.** Deterministic oracle-greedy and bounded oracle-lookahead rollouts must be trusted before stochastic branching is used as training data. Random-valid, oracle-greedy/lookahead, and oracle-scored temperature-softmax traces are mandatory before the first $Q_{H,\theta}$ training run. Gumbel-Top-k, CQL [@CQL-kumar2020], BCQ [@BCQ-fujimoto2019], Decision Transformer [@DecisionTransformer-chen2021], and IQL [@IQL-kostrikov2021] remain later evidence or ablation references unless the M5 result is stable. **One-step scorer evidence gate.** Before $Q_{H,\theta}$ is interpreted as a planning result, the learned one-step target-conditioned scorer must have held-out ranking evidence, oracle-evaluated model-selected rollouts, calibration and stage-shift diagnostics, and Rerun visualizations of representative successes and failures. This scorer is not the final thesis claim, but it is the required myopic control. **Success bar.** Learned $Q_{H,\theta}$ selected actions must be oracle-evaluated under equal acquisition and candidate budgets. The primary comparison is endpoint target-quality gain; the secondary training-aligned comparison is cumulative root-normalized target gain. When $\Delta_{\mathrm{look}}>0$, $Q_{H,\theta}$ should beat the learned one-step target-conditioned scorer and one-step model/greedy selection, while bounded oracle lookahead remains the upper reference and random-valid remains the lower reference. Also report the recovered headroom fraction: $$ \eta_Q = \frac{ J_{e,\Delta}^{(H)}(\pi_Q) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{learned\text{-}1}}) }{ J_{e,\Delta}^{(H)}(\pi_{\mathrm{oracle\text{-}look}}) - J_{e,\Delta}^{(H)}(\pi_{\mathrm{learned\text{-}1}}) +\varepsilon }. $$ Report $\eta_Q$ only when the denominator is positive and larger than a predefined minimum effect threshold; otherwise report raw endpoint gains and classify the comparison as no measurable headroom. The exact effect-size threshold remains an advisor-facing open decision. ## RQ3 - Actor-Visible Target Representation {#rq3-representation} **Question.** Which actor-visible target descriptor and matching protocol are sufficient for target-conditioned one-step scoring and $Q_{H,\theta}$ without leaking ground-truth target annotations into the actor input? **Working hypothesis.** RQ3 owns target inputs, not the Q-network architecture. The first path should use observed/predicted OBB geometry plus support signals. A compact actor-visible crop descriptor is the first planned ablation and may be promoted if it is cheap and leakage-safe. The strongest near-term crop descriptor candidate is semidense/fused target support with a target/context shell and ray-aware support; compressed DINO-on-point features are added only after logged visibility gating works. Entity tokens remain later ablations. **Main protocol.** The thesis protocol is {{< gls observed-target-selection >}} / {{< gls predicted-target-q >}} / {{< gls ground-truth-target-evaluation >}}. Target selection and model inputs use actor-visible predicted or observed descriptors. GT target crops and GT OBBs define labels and evaluation only. V0 may use GT OBB input for sanity or upper-bound runs; oracle target-task sampling remains the first rollout label path, and V1 is mandatory before deployable actor-input claims. Actor-visible target hypotheses are sourced from the EFM3D/EVL stack or the observed support derived from it: predicted or tracked OBBs, semantic class probabilities, confidence-style scores, projected area, voxel support, semidense/fused point support, ray-aware target/context support, and optional visibility-gated logged DINO-on-point descriptors. Cube R-CNN-style detections can be evaluated as auxiliary target proposals or ROI descriptors after ARIA/ASE adaptation; they are not the default scene memory. **First target input.** The minimum first-path descriptor is the observed/predicted OBB center, extents, orientation, class, confidence, projected area, relative pose, semidense point support, and EVL support. The first ablation adds a compact crop descriptor computed only from actor-visible spatial data inside the observed/predicted crop, such as pooled EVL/voxel features or point statistics. GT crops, GT mesh geometry, and GT OBBs are never used to build V1 actor-visible descriptors. | Descriptor source | Role | |---|---| | GT OBB | V0 sanity or upper-bound input only; never the main V1 actor input. | | Predicted/observed OBB | Mandatory V1 actor-visible target record. | | Crop descriptor | First V1 ablation: pooled EVL/voxel features, semidense/fused point statistics, ray-aware target/context support, or visibility-gated logged DINO-on-point descriptors inside the predicted/observed crop. | | Entity token | Later learned target/entity embedding ablation. | | Target point | Candidate-generation look-at point, such as OBB center, support centroid, missing-surface centroid, or expected-RRI centroid. | Descriptor source and descriptor encoding stay separate. Encoders may expose relative target translation, extents, 6D/R6D-style relative orientation, class-probability or semantic embeddings, projected-area/support scalars, and Fourier-feature or MLP encoders for continuous geometry. Crop descriptors enter only from actor-visible spatial evidence when the OBB-level contract is stable. **First target metric.** The first target-aware oracle metric is GT-OBB-cropped target RRI. Main thesis claims require graduating from V0 GT target input to V1 observed/predicted target input before target-conditioned scorer or $Q_{H,\theta}$ results are reported as actor-visible. **Matching rule.** Observed targets are matched to GT target labels by compatible class, OBB IoU, visibility/support, projected area, and semidense or EVL point support. Let $\mu(\hat e,e)$ denote this compact match score between an actor-visible target record $\hat e$ and a GT target candidate $e$: $$ e^\star=\arg\max_{e\in\mathcal E}\mu(\hat e,e), \qquad \mu_1=\mu(\hat e,e^\star), \qquad \mu_2=\max_{e\in\mathcal E\setminus\{e^\star\}}\mu(\hat e,e). $$ A match is accepted iff $$ \mu_1\ge \tau_\mu \qquad\text{and}\qquad \mu_1-\mu_2\ge \tau_{\mathrm{gap}}. $$ Unmatched targets, unsupported targets, and ambiguous multi-object matches are target-invalid protocol cases, not low target-RRI examples. Any criterion beyond compatible class, OBB IoU, visibility/support, projected area, and point support must be named from an observed M3 failure mode. **Connected implementation surfaces.** Target fields should enter through data-handling containers, then flow into the [VIN model API](../../reference/aria_nbv.vin.model_v3.VinModelV3.qmd), $Q_{H,\theta}$ replay rows, and diagnostics rather than through ad hoc dictionaries. ## RQ4 - Candidate, Rollout, and Scale Support {#rq4-support} **Question.** Do mixed target-centric and default exploration candidates, controlled branch/beam rollout support, and scale-controlled replay evidence improve target-conditioned NBV relative to pure generic sampling and pure target-point sampling? **Working hypothesis.** A mixed sampler should beat both a pure generic shell and a pure `TARGET_POINT` sampler when target viewpoints need diversity: target-centric candidates focus on the object or region of interest, while default candidates preserve exploratory views and reduce overfitting to a single look-at pattern. **Candidate families.** The thesis-core vocabulary is the package-supported finite candidate table exposed by [CandidateViewGeneratorConfig](../../reference/aria_nbv.pose_generation.CandidateViewGeneratorConfig.qmd), [SamplingStrategy](../../reference/aria_nbv.pose_generation.SamplingStrategy.qmd), the code-level `ViewDirectionMode` enum in `aria_nbv.pose_generation.types`, and [CandidateSamplingResult](../../reference/aria_nbv.pose_generation.CandidateSamplingResult.qmd): - `TARGET_POINT` candidates that look at a predicted/observed target point or OBB center; - `RADIAL_AWAY` and `RADIAL_TOWARDS` shell candidates around the current reference pose for default exploration and augmentation; - `FORWARD_RIG` candidates that preserve trajectory-forward view direction; - position sampling with `UNIFORM_SPHERE` and `FORWARD_POWERSPHERICAL`; - bounded yaw, pitch, and roll jitter superimposed on the chosen base orientation; - categorical-probability mixtures over those supported families, with candidate strategy provenance retained per sampled row. Frontier, missing-surface, expected-RRI heatmap, and GT look-at samplers are not first-path thesis candidates unless they are implemented, validated, and clearly marked as upper-bound or stretch evidence. Target-centric candidate generation uses the V1 observed/predicted target record unless the run is explicitly marked as V0 or upper-bound. **Evaluation nodes.** Candidate realism, valid fraction, invalid reason distribution, target visibility, target RRI distribution, scene RRI distribution, selected-view diversity, and strategy provenance for every candidate set. Hard masks and reason codes are required before any scorer or $Q_{H,\theta}$ consumes the table. **Rollout-support knobs.** Branch factor, beam width, and stochastic branch selection are dataset-support controls for $Q_{H,\theta}$, not replacement objectives. Deterministic ArgTopK expands a bounded set of valid high-score candidates; oracle-scored temperature-softmax and later Gumbel-Top-k make the branching stochastic to widen rollout support without changing the target-RRI label. These knobs belong with candidate/action-space evidence because they decide which finite action rows the value model can learn from. **Scale-support ladder.** RQ4 also owns support growth before RQ5 online interaction: - small trusted subset for geometry, target matching, invalidity, and replay correctness; - scene-level held-out subset or full ASE GT-mesh subset with exact reporting of scenes, snippets, targets, trajectories, rollout seeds, transitions, and missing coverage gaps; - broader external mesh/oracle-compatible datasets or simulator substrates only if they preserve target-specific point-mesh supervision and actor-visible target inputs. The implementation budget and concrete compute envelope are internal planning details. Public evidence must report achieved coverage, missing coverage, split boundaries, oracle throughput, and whether comparisons reuse identical roots, candidates, and budgets. ## Shared Scale Evidence Beyond Small Trusted Subsets {#rq-scale-evidence} **Protocol question.** Which scaling path preserves target-RRI supervision while moving from small trusted subsets to thesis-grade evidence and, if needed, beyond the 100 mesh-supervised ASE scene limit? **Working hypothesis.** Scaling should proceed in causal order. First, scale finite-candidate offline labels, rollout roots, targets, candidates, and transitions inside the ASE mesh-supervised subset. Second, if the 100 GT-mesh scene limit becomes the bottleneck, expand only to external substrates that can provide comparable mesh/oracle target-RRI labels. Third, hand off only then to RQ5 online discrete $Q_H$ interaction to test whether additional oracle interaction improves the finite-candidate policy. Coverage, uncertainty, or semantic proxy labels are valid contrast signals, but they do not replace mesh/oracle target RRI for thesis-grade comparisons. **Scale ladder.** - small trusted subset for geometry, target matching, invalidity, and replay correctness; - scene-level held-out subset or full ASE GT-mesh subset with exact reporting of scenes, snippets, targets, trajectories, rollout seeds, transitions, and missing coverage gaps; - broader external mesh/oracle-compatible datasets or simulator substrates if they preserve target-specific point-mesh supervision and actor-visible target inputs; - RQ5 online discrete $Q_H$ in the ASE or external mesh/oracle loop after offline finite-candidate evidence is stable. The implementation budget and concrete compute envelope are internal planning details. Public evidence must instead report achieved coverage, missing coverage, split boundaries, oracle throughput, and whether comparisons reuse identical roots, candidates, and budgets. Scale axes are reported separately: scenes, snippets, anchor poses per trajectory, candidate sets, candidate-distribution variants, targets per snippet, rollout seeds, transitions, and stage/calibration bins. Architecture, rollout, scorer, or $Q_H$ conclusions must not be compared across runs that silently change scene, snippet, target, or candidate coverage. ## RQ5 - Online Discrete Q_H Bridge {#rq5-online} **Question.** After the planned offline finite-candidate $Q_H$ result, does online interaction over the same discrete finite-candidate action contract improve endpoint gain or recovered headroom enough to justify the extra data collection loop? **Working hypothesis.** Online discrete $Q_H$ is the first post-offline bridge because it preserves the finite action table, hard masks, target-RRI reward, and oracle re-evaluation contract. It is an RQ5 extension after stable offline $Q_H$, not a replacement for the offline evidence gate. **Online-discrete gate.** - online discrete $Q_H$ with the same finite action contract and oracle re-evaluation; - compare offline fitted $Q_H$ against online-updated finite-candidate policy under matched roots, targets, candidates, and acquisition budget; - report whether online interaction changes endpoint gain, cumulative root-normalized target gain, invalid-action handling, or recovered headroom. ## RQ6 - Continuous and Simulator Escalation {#rq6-continuous} **Question.** After finite-candidate and online-discrete evidence, do continuous or hierarchical target-then-pose policies have measurable headroom over the best finite-candidate policy under the same target-RRI objective? **Working hypothesis.** Continuous action spaces require online training first. Continuous policies are plausible improvements over simple algorithmic candidate generation because they are not restricted to the hand-designed finite candidate table, but they should not be used to rescue a failed target-RRI, rollout, offline $Q_H$, or online-discrete contract. Hestia-style hierarchy [@Hestia-lu2026] and GenNBV-style continuous control [@GenNBV-chen2024] are relevant design references. Imitation-learning variants are deferred beyond the planned RRI + $Q_H$ approach. **Escalation ladder.** - continuous target-then-pose actor-critic only after online interaction, reward speed, invalid-action handling, and evaluation are thesis-grade; - simulator-backed online RL with Habitat, Isaac, or external datasets only if the mesh/oracle target-RRI supervision contract can be preserved; - optional semantic-global planning over grounded portals, frontiers, entities, and regions with SceneScript-style or open-vocabulary entity memory [@SceneScript-avetisyan2024] and real-device guidance as future work. ## Shared Evidence and Protocol Constraints {#rq-evidence-protocol} Invalidity remains a shared constraint that makes every empirical answer interpretable. Scale support is part of RQ4, but coverage reporting remains a shared evidence contract across all empirical RQs. **Invalidity protocol.** Collision, out-of-bounds poses, no-depth cases, bad frusta, outside-EVL-extent cases, and candidates that cannot produce a meaningful oracle/evaluation sample remain hard masks plus explicit reason codes. Low immediate target visibility or support is a diagnostic or label-quality flag unless evaluation is impossible; otherwise it would mask valid setup actions and make non-myopic planning myopic by construction. Invalid cases are never encoded as the lowest RRI bin. $Q_{H,\theta}$ masks invalid candidates before argmax, softmax, loss targets, and bootstrap maximization. Report invalid fraction, invalid-reason distribution, target-visible fraction, rank metrics with and without masks, and the effect of mask handling on selected-action oracle evaluation. Whether invalidity should become a learned signal through validity heads, scalar penalties, or continuous-action feasibility models remains a supervisor-facing escalation decision. **Coverage protocol.** Every empirical claim reports the actual mesh-supervised coverage used: scenes, snippets, targets, trajectories, anchor poses, candidates, rollout seeds, transitions, split boundaries, invalid gaps, and missing coverage gaps. The final target remains full 100 GT-mesh ASE scenes and 4,608 snippet windows when feasible; a scene-level held-out subset is acceptable only with exact coverage reporting and scene-level train/validation/test boundaries. Sample-level splitting across snippets from the same scene is not acceptable for final claims. **Equal-budget protocol.** Unless stated otherwise, equal budget means equal selected-view horizon $H$, equal candidate count $N_q$ per decision step, equal candidate-generation distribution, and matched validity constraints. Path length, runtime, and oracle evaluation count are reported separately; path/time-constrained variants are explicit ablations. **Storage and scale protocol.** The Zarr-first rollout/Q store should avoid duplicating raw ASE/ATEK assets: full meshes remain external path/hash/version references, high-detail target crops are stored once per target with crop metadata, and rollout rows reference those assets. LRZ deterministic sharding, Slurm/DSS staging, resume-safe writes, and storage-budget reporting are hard gates before full-scale generation. **Ablation protocol.** Required ablation axes are target representation, candidate sampler mixture and rollout-support branching, one-step scorer versus $Q_{H,\theta}$, invalid-mask handling, surface reconstruction input, CORAL variant, auxiliary regression, candidate-relative pose encoding, stage-aware calibration, and storage/scale gates. Rollout and $Q_H$ data should use the [Zarr-first rollout/Q store](../theory/rl_planning.qmd) before large-scale generation. ## Research Matrix {#rq-matrix} | Question | Roadmap gate | Primary evidence | Main linked surface | |---|---|---|---| | [RQ1](#rq1-objective) | M1, M5 | endpoint target-quality gain, cumulative target-root gain, diagnostic target RRI, acquisition-cost curves | [RRI theory](../theory/rri_theory.qmd) | | [RQ2](#rq2-offline-qh) | M4, M5 | one-step scorer evidence gate, oracle lookahead, rollout traces, endpoint gain, cumulative target-root gain, diagnostic target RRI, and $Q_{H,\theta}$ success bar | [Rollout/Q_H contract](../theory/rl_planning.qmd) | | [RQ3](#rq3-representation) | M3, M4 | actor-visible target encoding and crop-descriptor ablation | [EVL notes](../literature/efm3d.qmd) | | [RQ4](#rq4-support) | M2 to M7 | mixed candidate sampler, branch/beam support, stochastic rollout diversity, validity statistics, ASE scale, external mesh/oracle-compatible expansion, and coverage reports | [Candidate API](../../reference/aria_nbv.pose_generation.CandidateViewGenerator.qmd) | | [RQ5](#rq5-online) | M6 | online discrete Q_H over the same finite-candidate action contract after offline $Q_H$ is trusted | [Roadmap M6](roadmap.qmd#roadmap-m6) | | [RQ6](#rq6-continuous) | M6 | time-permitting continuous target-then-pose policy, hierarchy, or simulator escalation after finite-candidate and online-discrete evidence | [Roadmap M6](roadmap.qmd#roadmap-m6) | | [Protocol](#rq-evidence-protocol) | M1 to M7 | invalidity masks/reason metrics, scene-level splits, no sample leakage, coverage reports, calibration/stage-shift diagnostics, Zarr/LRZ gates, and ablation tables | [M1 contract report](m1_contract_report.qmd) | ## KG-Friendly Question Writing {#rq-kg-writing} Each research question is written as a graph node with: - a stable section anchor; - a one-sentence definition or hypothesis; - aliases when terminology can drift; - links to internal docs, API surfaces, and roadmap gates; - citations through `docs/references.bib` when a question depends on prior work. New thesis docs and public docstrings should follow the same pattern so later KG extraction can connect concepts, code, experiments, and evidence without inferring relationships from prose alone.