Candidate-View Dependence

1 Candidate-View Dependence

This page owns the theory contract for treating ARIA-NBV candidate views as a structured finite set rather than as independent rows. It extracts the architecture implications of the candidate-view-dependence design note and connects them to RRI theory, candidate sampling, finite-candidate rollout and $Q_H$, and SCONE/FisherRF support channels.

1.1 Core Claim

For a fixed actor-visible state $s$, target $e$, and candidate-generation process, target-specific RRI can be viewed as a scalar utility field over feasible camera poses:

\[ F_{s,e}:\mathrm{SE}(3)\rightarrow \mathbb{R}, \qquad F_{s,e}(q)=\mathrm{RRI}_e(s,q). \]

The seminar/VIN-style scorer learned point samples of this field independently:

\[ \hat r_i=f_\theta(s,e,q_i). \]

ARIA-NBV should test whether candidate-set context helps by modeling:

\[ \hat r_i=f_\theta(s,e,q_i,\mathcal Q_t,m_t), \qquad \mathcal Q_t=\{q_{t,i}\}_{i=1}^{N_q}\subset \mathrm{SE}(3), \]

where $m_t$ is the validity mask and candidate metadata. The scientific claim is not that Transformers are universally better. The claim is that a finite candidate table is an unordered sampled view of a local utility field, and its geometry, redundancy, support gaps, and masks can carry information that independent scoring discards.

1.2 Smoothness Caveat

The RRI field is not globally smooth. It can change abruptly when a target surface enters or leaves the frustum, a ray crosses an occlusion boundary, a pose becomes invalid, a target crop loses support, or a wall/collision constraint changes. The correct assumption is:

\[ \text{RRI is locally structured and often locally smooth only inside stable visibility regimes.} \]

For nearby poses in a stable regime, compare candidates by a local Lie-algebra displacement:

\[ \xi_{ij}=\log(q_i^{-1}q_j)\in \mathfrak{se}(3). \]

A first-order local approximation is:

\[ F(q_i\exp(\xi)) \approx F(q_i) + \nabla_\xi F(q_i)^\top \xi. \]

This motivates candidate interaction, but it also forbids blind smoothness losses across occlusion, invalidity, or support-boundary changes.

1.3 Why Independent Candidate Scoring Is Limited

Independent candidate scoring cannot represent these candidate-population facts:

several candidates are near-duplicates;
a candidate lies on the same target-bearing arc as another candidate;
the sampled set contains no good target-facing view;
a candidate is an outlier relative to the proposal distribution;
nearby poses in $\mathrm{SE}(3)$ should often have correlated RRI;
one candidate covers target-local support that other candidates miss.

VIN-NBV provides the direct precedent for sampled-view RRI prediction [1], but ARIA-NBV’s additional question is whether the finite candidate table itself contains exploitable structure beyond pointwise scoring.

1.4 Literature Foundations

foundation	key idea	ARIA-NBV transfer
Deep Sets [2]	Set functions should respect permutation symmetry.	Candidate-row order must not change the semantic output.
Set Transformer [3]	Attention can model interactions among set elements while preserving set symmetry.	Masked candidate self-attention is an interaction ablation after independent candidate-to-state calibration.
FisherRF [4]	View utility is expected information gain, and overlapping views have diminishing returns.	Candidate value can depend on target-local information overlap with other candidates and existing evidence.
Multipoint expected improvement [5]	Batch Bayesian optimization treats selected points jointly, not as independent top-$q$ points.	A candidate table is a batch of possible measurements; redundancy and diversity matter.
Adaptive submodularity [6]	Active sensing often has diminishing returns under partial observations.	Support, visibility, and information channels can behave submodularly even when RRI does not.
Conditional Neural Processes [7]	A model can infer a latent function from a set of context/query locations.	Candidate coordinates and features reveal the local design distribution sampled from the utility field.
Geometric deep learning [8]	Architectures should respect the structure and symmetries of physical data.	Use a permutation-equivariant set model with $\mathrm{SE}(3)$-aware relative geometry and $\mathbb S^2$ visibility memory.

The resulting model family is:

\[ x_i=\phi(s,e,q_i), \qquad \{z_i\}_{i=1}^{N_q} = \mathrm{SetEncoder}(\{x_i\}_{i=1}^{N_q},m_t), \qquad \hat r_i=h(z_i). \]

The required symmetry is permutation equivariance:

\[ f(\Pi X)=\Pi f(X), \]

where $\Pi$ permutes candidate rows.

1.5 Fisher And SCONE Overlap View

FisherRF makes the redundancy argument explicit. If $F_t$ is accumulated information and $F_q$ is the candidate’s Fisher information, then under a diagonal/Laplace approximation:

\[ \mathrm{EIG}(q) \approx \frac12 \sum_j \log\left( 1+ \frac{f_{q,j}}{f_{t,j}+\lambda} \right). \]

Two candidates that observe the same uncertain region are not independent:

\[ U(\{q_i,q_j\})\neq U(q_i)+U(q_j). \]

SCONE/Fisher-inspired support vectors can expose that overlap. For candidate $i$, define a target-local support vector:

\[ f_i(v) = p_{\mathrm{surf}}(v) p_{\mathrm{vis}}(v,q_i) p_{\mathrm{novel}}(v,q_i). \]

Pairwise redundancy can be summarized as:

\[ o_{ij} = \frac{ \sum_v f_i(v)f_j(v) }{ \sqrt{\sum_v f_i(v)^2}\sqrt{\sum_v f_j(v)^2}+\epsilon }. \]

This can be an attention bias or diagnostic channel; it should not replace target-RRI labels.

1.6 Candidate Interaction: Where It Helps

The physical one-step label for a candidate is independent of the other sampled candidates:

\[ r_i=F_{s,e}(q_i). \]

Changing other rows in $\mathcal Q_t$ does not change the oracle immediate RRI for $q_i$. This makes set context risky for absolute scalar regression. Candidate interaction can still help in four bounded ways.

use	mechanism
Local field shape	Nearby candidates in $\mathrm{SE}(3)$ and support space can share evidence.
Ranking	NBV needs $\arg\max_i r_i$; relative comparisons can be easier than global RRI calibration.
Candidate-set degeneracy	A set model can detect that all candidates are low-support, near-duplicate, or invalid-dominated.
Finite-horizon $Q_H$	Future value legitimately depends on the candidate-generation process, masks, branch support, and successor candidate tables.

The case for interaction is therefore strongest for $Q_H$, but it is still worth testing for the myopic scorer as a controlled ablation.

1.7 Recommended Architecture

Use an independent calibrated base plus candidate-to-state residual context. This preserves absolute RRI semantics while allowing target, map, history, and budget evidence to alter finite-horizon value:

\[ \hat r_i^{\mathrm{abs}} = f_{\mathrm{abs}}(s,e,q_i), \]

\[ \delta_i = f_{\mathrm{query}}(x_i,\{T_e,M_t^{\mathrm{ray}},H_t,B_t,E_0^{\mathrm{EVL}}\}), \]

\[ \boxed{ Q_{H,i} = \hat r_i^{\mathrm{abs}} + \delta_i^H }. \]

The residual can be regularized, but it should not be exactly mean-centred within each sampled candidate set. Adding or duplicating an unrelated valid row would change the per-set mean and therefore shift TD targets for other physical candidates.

For $Q_H$, the same pattern becomes an uncentred residual value head:

\[ Q_H(s,e,q_i;\mathcal Q_t,m_t) = \hat r_\psi^e(s,q_i) + \delta_{\theta,i}^{H}(s,e,q_i,M_t^{\mathrm{ray}},h_t). \]

Candidate-candidate self-attention remains useful for policy logits, diversity, top-$k$ branch construction, or a separately evaluated context ablation, but it is not the default physical value estimator.

1.8 Candidate Token And Pairwise Features

Candidate tokens should contain more than pose:

\[ x_i= \left[ \phi_{\mathrm{pose}}(q_i), \phi_{\mathrm{target}}(q_i,e), \phi_{\mathrm{support}}(q_i,e), \phi_{\mathrm{frustum}}(q_i), \phi_{\mathrm{history}}(q_i,h_t), \phi_{\mathrm{dir}}(q_i,e), \phi_{\mathrm{valid}}(m_i,\rho_i), \phi_{\mathrm{strategy}}(i) \right]. \]

Useful absolute pose features include:

\[ t_i^{\mathrm{rel}}, \qquad t_i-t_e, \qquad R_i^{6D}, \qquad R_i^\top(t_e-t_i). \]

For attention from candidate $i$ to candidate or context item $j$, use relative $\mathrm{SE}(3)$ features:

\[ \xi_{ij}=\log(q_i^{-1}q_j)\in\mathfrak{se}(3), \]

\[ b_{ij} = \phi_{\mathrm{rel}}(\xi_{ij},d_{ij},\theta_{ij},o_{ij}), \]

as an attention bias:

\[ \alpha_{ij} = \operatorname{softmax}_j \left( \frac{Q_iK_j^\top}{\sqrt d} + b_{ij} \right). \]

Directional memory gives a complementary $\mathbb S^2$ signal:

\[ M_{\mathrm{dir}}(v) = \sum_{k<t} w_k(v)d_k(v)d_k(v)^\top, \]

\[ \nu_i(v) = 1- \frac{ d_i(v)^\top M_{\mathrm{dir}}(v)d_i(v) }{ \operatorname{tr}M_{\mathrm{dir}}(v)+\epsilon }. \]

Pose relations, support overlap, and directional novelty are different channels and should be logged separately.

1.9 Losses That Use Candidate Structure

Do not rely only on pointwise MSE or CORAL-style ordinal loss.

The absolute loss preserves physical calibration:

\[ \mathcal L_{\mathrm{abs}} = \sum_{i\in\mathcal A_t} \ell(\hat r_i,r_i). \]

The pairwise ranking loss directly trains candidate selection:

\[ \mathcal L_{\mathrm{rank}} = \sum_{i,j\in\mathcal A_t} \mathbf 1[r_i>r_j] \log\left( 1+\exp(-(\hat r_i-\hat r_j)) \right). \]

A listwise loss compares oracle and predicted soft top-candidate distributions:

\[ p_i^\star = \frac{\exp(r_i/\tau)} {\sum_{j\in\mathcal A_t}\exp(r_j/\tau)}, \qquad \hat p_i = \frac{\exp(\hat r_i/\tau)} {\sum_{j\in\mathcal A_t}\exp(\hat r_j/\tau)}, \]

\[ \mathcal L_{\mathrm{list}} = \mathrm{KL}(p^\star\|\hat p). \]

A smoothness regularizer is valid only within similar visibility regimes:

\[ w_{ij} = \exp\left( -\frac{\|\log(q_i^{-1}q_j)\|^2}{2\sigma_q^2} \right) \mathbf 1[\text{similar visibility regime}], \]

\[ \mathcal L_{\mathrm{smooth}} = \sum_{i,j}w_{ij}(\hat r_i-\hat r_j)^2. \]

Do not smooth across invalidity, occlusion, frustum, or target-support boundaries. Weight this loss down when frustum/target overlap is low.

1.10 Invariance And Robustness Tests

Candidate-set models must pass row-level invariance checks before any gain is trusted.

test	required behavior
Row shuffle	$f(\Pi X)=\Pi f(X)$.
Duplicate row	$\hat r(q_i;\mathcal Q_t)\approx\hat r(q_i;\mathcal Q_t\cup\{q_i\})$ for the original row.
Mask isolation	Invalid/padded rows must not affect valid outputs except through explicit valid-count features.
Candidate-family robustness	Performance should not collapse when mixture proportions shift.
Valid-count sensitivity	Scores should remain calibrated across different numbers of valid rows.

The duplicate test is a diagnostic first, not necessarily a hard loss. It catches attention-normalization artifacts that can corrupt absolute RRI semantics.

1.11 Experimental Ladder

Use a staged ladder so gains can be attributed:

stage	model	purpose
A0	Independent candidate scorer $\hat r_i=f(x_i)$	Seminar/VIN-style baseline.
A1	Candidate-to-state query $u_i=\mathrm{CrossAttn}(x_i,\{T_e,M_t,H_t,B_t,E_0^{\mathrm{EVL}}\})$	Test target/map/history context without candidate-candidate value coupling.
A2	DeepSets context $g=\rho(\sum_{j\in\mathcal A_t}\phi(x_j))$, $\hat r_i=f(x_i,g)$	Test whether global candidate-set context helps.
A3	Masked Set Transformer $Z=\mathrm{SetTransformer}(X,m_t)$, $\hat r_i=h(Z_i)$	Test candidate-candidate interaction.
A4	Set Transformer plus $\mathrm{SE}(3)$ relative bias	Test geometric interaction.
A5	Fisher/SCONE overlap bias $o_{ij}^{\mathrm{target}},o_{ij}^{\mathrm{frustum}},o_{ij}^{\mathrm{dir}}$	Test information-overlap structure.
A6	Uncentred residual $Q_H$ over finite candidates	Main finite-horizon use case.

The evaluation should report more than validation loss:

Spearman rank correlation,
Kendall $\tau$,
top-1 and top-3 oracle hit,
oracle RRI of the selected candidate,
expected-RRI calibration,
duplicate-row robustness,
row-shuffle equivariance,
candidate-family robustness,
valid-count sensitivity,
free-shell versus realistic-sampler split.

For $Q_H$, report endpoint target gain, oracle-lookahead recovery, per-scene win rate, per-target support bin, and per-candidate-family selected frequency.

1.12 Failure Modes And Mitigations

risk	failure mode	mitigation
Absolute-label contamination	The same candidate receives different predicted immediate RRI or TD value depending on unrelated sampled rows.	Keep an independent absolute head and use candidate-to-state residuals before candidate-candidate context.
Generator overfitting	The model learns candidate-mixture quirks rather than geometry.	Vary ordering and mixture proportions; evaluate on held-out candidate generators.
Duplicate instability	Adding duplicate candidates changes scores.	Add duplicate stress tests and inspect attention normalization.
Mask leakage	Invalid rows influence valid scores.	Enforce hard masks in attention, argmax, losses, and bootstrap targets.
Shortcut learning	The model learns strategy priors instead of target-RRI geometry.	Report per-strategy RRI histograms and with/without-strategy ablations.

1.13 Thesis Framing

The thesis-safe framing is:

The independent VIN-style scorer estimates point samples of a view-utility field over $\mathrm{SE}(3)$. ARIA-NBV first tests whether candidate-to-state queries over target, map, history, and budget evidence improve finite-horizon value prediction, then ablates whether candidate-set context adds useful interaction without corrupting absolute target-gain calibration.

This is grounded by VIN-NBV’s RRI-based sampled-view ranking [1], FisherRF’s expected-information-gain view of candidate measurements [4], Deep Sets and Set Transformer set modeling [2], [3], and geometric deep learning’s symmetry principle [8]. The implementation should stay conservative: retain the calibrated independent scorer, use candidate-to-state context for the first residual value head, and let candidate-set interaction affect policy context or residual ablations only after duplicate-row and valid-count tests pass.

References

[1]

N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219

[2]

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. J. Smola, “Deep sets,” in Advances in neural information processing systems, 2017. Available: https://papers.nips.cc/paper_files/paper/2017/hash/f22e4747da1aa27e363d86d40ff442fe-Abstract.html

[3]

J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” in Proceedings of the 36th international conference on machine learning, in Proceedings of machine learning research, vol. 97. PMLR, 2019, pp. 3744–3753. Available: https://proceedings.mlr.press/v97/lee19d.html

[4]

W. Jiang, B. Lei, and K. Daniilidis, “FisherRF: Active view selection and uncertainty quantification for radiance fields using fisher information.” 2024. Available: https://arxiv.org/abs/2311.17874

[5]

S. Marmin, C. Chevalier, and D. Ginsbourger, “Differentiating the multipoint expected improvement for optimal batch design.” 2015. Available: https://arxiv.org/abs/1503.05509

[6]

D. Golovin and A. Krause, “Adaptive submodularity: Theory and applications in active learning and stochastic optimization,” Journal of Artificial Intelligence Research, vol. 42, pp. 427–486, 2011, doi: 10.1613/jair.3278.

[7]

M. Garnelo et al., “Conditional neural processes.” 2018. Available: https://arxiv.org/abs/1807.01613

[8]

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.” 2021. Available: https://arxiv.org/abs/2104.13478

--- title: "Candidate-View Dependence" phase: thesis audience: public status: current owner: jan format: html --- # Candidate-View Dependence {#candidate-view-dependence} This page owns the theory contract for treating ARIA-NBV candidate views as a structured finite set rather than as independent rows. It extracts the architecture implications of the candidate-view-dependence design note and connects them to [RRI theory](rri_theory.qmd), [candidate sampling](candidate_sampling_target_selection.qmd), [finite-candidate rollout and $Q_H$](rl_planning.qmd), and [SCONE/FisherRF support channels](../literature/scone_fisherrf.qmd). ## Core Claim For a fixed actor-visible state $s$, target $e$, and candidate-generation process, target-specific RRI can be viewed as a scalar utility field over feasible camera poses: $$ F_{s,e}:\mathrm{SE}(3)\rightarrow \mathbb{R}, \qquad F_{s,e}(q)=\mathrm{RRI}_e(s,q). $$ The seminar/VIN-style scorer learned point samples of this field independently: $$ \hat r_i=f_\theta(s,e,q_i). $$ ARIA-NBV should test whether candidate-set context helps by modeling: $$ \hat r_i=f_\theta(s,e,q_i,\mathcal Q_t,m_t), \qquad \mathcal Q_t=\{q_{t,i}\}_{i=1}^{N_q}\subset \mathrm{SE}(3), $$ where $m_t$ is the validity mask and candidate metadata. The scientific claim is not that Transformers are universally better. The claim is that a finite candidate table is an unordered sampled view of a local utility field, and its geometry, redundancy, support gaps, and masks can carry information that independent scoring discards. ## Smoothness Caveat The RRI field is not globally smooth. It can change abruptly when a target surface enters or leaves the frustum, a ray crosses an occlusion boundary, a pose becomes invalid, a target crop loses support, or a wall/collision constraint changes. The correct assumption is: $$ \text{RRI is locally structured and often locally smooth only inside stable visibility regimes.} $$ For nearby poses in a stable regime, compare candidates by a local Lie-algebra displacement: $$ \xi_{ij}=\log(q_i^{-1}q_j)\in \mathfrak{se}(3). $$ A first-order local approximation is: $$ F(q_i\exp(\xi)) \approx F(q_i) + \nabla_\xi F(q_i)^\top \xi. $$ This motivates candidate interaction, but it also forbids blind smoothness losses across occlusion, invalidity, or support-boundary changes. ## Why Independent Candidate Scoring Is Limited Independent candidate scoring cannot represent these candidate-population facts: - several candidates are near-duplicates; - a candidate lies on the same target-bearing arc as another candidate; - the sampled set contains no good target-facing view; - a candidate is an outlier relative to the proposal distribution; - nearby poses in $\mathrm{SE}(3)$ should often have correlated RRI; - one candidate covers target-local support that other candidates miss. VIN-NBV provides the direct precedent for sampled-view RRI prediction [@VIN-NBV-frahm2025], but ARIA-NBV's additional question is whether the finite candidate table itself contains exploitable structure beyond pointwise scoring. ## Literature Foundations | foundation | key idea | ARIA-NBV transfer | |---|---|---| | Deep Sets [@DeepSets-zaheer2017] | Set functions should respect permutation symmetry. | Candidate-row order must not change the semantic output. | | Set Transformer [@SetTransformer-lee2019] | Attention can model interactions among set elements while preserving set symmetry. | Masked candidate self-attention is an interaction ablation after independent candidate-to-state calibration. | | FisherRF [@FisherRF-jiang2024] | View utility is expected information gain, and overlapping views have diminishing returns. | Candidate value can depend on target-local information overlap with other candidates and existing evidence. | | Multipoint expected improvement [@MultipointEI-marmin2015] | Batch Bayesian optimization treats selected points jointly, not as independent top-$q$ points. | A candidate table is a batch of possible measurements; redundancy and diversity matter. | | Adaptive submodularity [@AdaptiveSubmodularity-golovin2011] | Active sensing often has diminishing returns under partial observations. | Support, visibility, and information channels can behave submodularly even when RRI does not. | | Conditional Neural Processes [@ConditionalNeuralProcesses-garnelo2018] | A model can infer a latent function from a set of context/query locations. | Candidate coordinates and features reveal the local design distribution sampled from the utility field. | | Geometric deep learning [@GeometricDeepLearning-bronstein2021] | Architectures should respect the structure and symmetries of physical data. | Use a permutation-equivariant set model with $\mathrm{SE}(3)$-aware relative geometry and $\mathbb S^2$ visibility memory. | The resulting model family is: $$ x_i=\phi(s,e,q_i), \qquad \{z_i\}_{i=1}^{N_q} = \mathrm{SetEncoder}(\{x_i\}_{i=1}^{N_q},m_t), \qquad \hat r_i=h(z_i). $$ The required symmetry is permutation equivariance: $$ f(\Pi X)=\Pi f(X), $$ where $\Pi$ permutes candidate rows. ## Fisher And SCONE Overlap View FisherRF makes the redundancy argument explicit. If $F_t$ is accumulated information and $F_q$ is the candidate's Fisher information, then under a diagonal/Laplace approximation: $$ \mathrm{EIG}(q) \approx \frac12 \sum_j \log\left( 1+ \frac{f_{q,j}}{f_{t,j}+\lambda} \right). $$ Two candidates that observe the same uncertain region are not independent: $$ U(\{q_i,q_j\})\neq U(q_i)+U(q_j). $$ SCONE/Fisher-inspired support vectors can expose that overlap. For candidate $i$, define a target-local support vector: $$ f_i(v) = p_{\mathrm{surf}}(v) p_{\mathrm{vis}}(v,q_i) p_{\mathrm{novel}}(v,q_i). $$ Pairwise redundancy can be summarized as: $$ o_{ij} = \frac{ \sum_v f_i(v)f_j(v) }{ \sqrt{\sum_v f_i(v)^2}\sqrt{\sum_v f_j(v)^2}+\epsilon }. $$ This can be an attention bias or diagnostic channel; it should not replace target-RRI labels. ## Candidate Interaction: Where It Helps The physical one-step label for a candidate is independent of the other sampled candidates: $$ r_i=F_{s,e}(q_i). $$ Changing other rows in $\mathcal Q_t$ does not change the oracle immediate RRI for $q_i$. This makes set context risky for absolute scalar regression. Candidate interaction can still help in four bounded ways. | use | mechanism | |---|---| | Local field shape | Nearby candidates in $\mathrm{SE}(3)$ and support space can share evidence. | | Ranking | NBV needs $\arg\max_i r_i$; relative comparisons can be easier than global RRI calibration. | | Candidate-set degeneracy | A set model can detect that all candidates are low-support, near-duplicate, or invalid-dominated. | | Finite-horizon $Q_H$ | Future value legitimately depends on the candidate-generation process, masks, branch support, and successor candidate tables. | The case for interaction is therefore strongest for $Q_H$, but it is still worth testing for the myopic scorer as a controlled ablation. ## Recommended Architecture Use an independent calibrated base plus candidate-to-state residual context. This preserves absolute RRI semantics while allowing target, map, history, and budget evidence to alter finite-horizon value: $$ \hat r_i^{\mathrm{abs}} = f_{\mathrm{abs}}(s,e,q_i), $$ $$ \delta_i = f_{\mathrm{query}}(x_i,\{T_e,M_t^{\mathrm{ray}},H_t,B_t,E_0^{\mathrm{EVL}}\}), $$ $$ \boxed{ Q_{H,i} = \hat r_i^{\mathrm{abs}} + \delta_i^H }. $$ The residual can be regularized, but it should not be exactly mean-centred within each sampled candidate set. Adding or duplicating an unrelated valid row would change the per-set mean and therefore shift TD targets for other physical candidates. For $Q_H$, the same pattern becomes an uncentred residual value head: $$ Q_H(s,e,q_i;\mathcal Q_t,m_t) = \hat r_\psi^e(s,q_i) + \delta_{\theta,i}^{H}(s,e,q_i,M_t^{\mathrm{ray}},h_t). $$ Candidate-candidate self-attention remains useful for policy logits, diversity, top-$k$ branch construction, or a separately evaluated context ablation, but it is not the default physical value estimator. ## Candidate Token And Pairwise Features Candidate tokens should contain more than pose: $$ x_i= \left[ \phi_{\mathrm{pose}}(q_i), \phi_{\mathrm{target}}(q_i,e), \phi_{\mathrm{support}}(q_i,e), \phi_{\mathrm{frustum}}(q_i), \phi_{\mathrm{history}}(q_i,h_t), \phi_{\mathrm{dir}}(q_i,e), \phi_{\mathrm{valid}}(m_i,\rho_i), \phi_{\mathrm{strategy}}(i) \right]. $$ Useful absolute pose features include: $$ t_i^{\mathrm{rel}}, \qquad t_i-t_e, \qquad R_i^{6D}, \qquad R_i^\top(t_e-t_i). $$ For attention from candidate $i$ to candidate or context item $j$, use relative $\mathrm{SE}(3)$ features: $$ \xi_{ij}=\log(q_i^{-1}q_j)\in\mathfrak{se}(3), $$ $$ b_{ij} = \phi_{\mathrm{rel}}(\xi_{ij},d_{ij},\theta_{ij},o_{ij}), $$ as an attention bias: $$ \alpha_{ij} = \operatorname{softmax}_j \left( \frac{Q_iK_j^\top}{\sqrt d} + b_{ij} \right). $$ Directional memory gives a complementary $\mathbb S^2$ signal: $$ M_{\mathrm{dir}}(v) = \sum_{k<t} w_k(v)d_k(v)d_k(v)^\top, $$ $$ \nu_i(v) = 1- \frac{ d_i(v)^\top M_{\mathrm{dir}}(v)d_i(v) }{ \operatorname{tr}M_{\mathrm{dir}}(v)+\epsilon }. $$ Pose relations, support overlap, and directional novelty are different channels and should be logged separately. ## Losses That Use Candidate Structure Do not rely only on pointwise MSE or CORAL-style ordinal loss. The absolute loss preserves physical calibration: $$ \mathcal L_{\mathrm{abs}} = \sum_{i\in\mathcal A_t} \ell(\hat r_i,r_i). $$ The pairwise ranking loss directly trains candidate selection: $$ \mathcal L_{\mathrm{rank}} = \sum_{i,j\in\mathcal A_t} \mathbf 1[r_i>r_j] \log\left( 1+\exp(-(\hat r_i-\hat r_j)) \right). $$ A listwise loss compares oracle and predicted soft top-candidate distributions: $$ p_i^\star = \frac{\exp(r_i/\tau)} {\sum_{j\in\mathcal A_t}\exp(r_j/\tau)}, \qquad \hat p_i = \frac{\exp(\hat r_i/\tau)} {\sum_{j\in\mathcal A_t}\exp(\hat r_j/\tau)}, $$ $$ \mathcal L_{\mathrm{list}} = \mathrm{KL}(p^\star\|\hat p). $$ A smoothness regularizer is valid only within similar visibility regimes: $$ w_{ij} = \exp\left( -\frac{\|\log(q_i^{-1}q_j)\|^2}{2\sigma_q^2} \right) \mathbf 1[\text{similar visibility regime}], $$ $$ \mathcal L_{\mathrm{smooth}} = \sum_{i,j}w_{ij}(\hat r_i-\hat r_j)^2. $$ Do not smooth across invalidity, occlusion, frustum, or target-support boundaries. Weight this loss down when frustum/target overlap is low. ## Invariance And Robustness Tests Candidate-set models must pass row-level invariance checks before any gain is trusted. | test | required behavior | |---|---| | Row shuffle | $f(\Pi X)=\Pi f(X)$. | | Duplicate row | $\hat r(q_i;\mathcal Q_t)\approx\hat r(q_i;\mathcal Q_t\cup\{q_i\})$ for the original row. | | Mask isolation | Invalid/padded rows must not affect valid outputs except through explicit valid-count features. | | Candidate-family robustness | Performance should not collapse when mixture proportions shift. | | Valid-count sensitivity | Scores should remain calibrated across different numbers of valid rows. | The duplicate test is a diagnostic first, not necessarily a hard loss. It catches attention-normalization artifacts that can corrupt absolute RRI semantics. ## Experimental Ladder Use a staged ladder so gains can be attributed: | stage | model | purpose | |---|---|---| | A0 | Independent candidate scorer $\hat r_i=f(x_i)$ | Seminar/VIN-style baseline. | | A1 | Candidate-to-state query $u_i=\mathrm{CrossAttn}(x_i,\{T_e,M_t,H_t,B_t,E_0^{\mathrm{EVL}}\})$ | Test target/map/history context without candidate-candidate value coupling. | | A2 | DeepSets context $g=\rho(\sum_{j\in\mathcal A_t}\phi(x_j))$, $\hat r_i=f(x_i,g)$ | Test whether global candidate-set context helps. | | A3 | Masked Set Transformer $Z=\mathrm{SetTransformer}(X,m_t)$, $\hat r_i=h(Z_i)$ | Test candidate-candidate interaction. | | A4 | Set Transformer plus $\mathrm{SE}(3)$ relative bias | Test geometric interaction. | | A5 | Fisher/SCONE overlap bias $o_{ij}^{\mathrm{target}},o_{ij}^{\mathrm{frustum}},o_{ij}^{\mathrm{dir}}$ | Test information-overlap structure. | | A6 | Uncentred residual $Q_H$ over finite candidates | Main finite-horizon use case. | The evaluation should report more than validation loss: - Spearman rank correlation, - Kendall $\tau$, - top-1 and top-3 oracle hit, - oracle RRI of the selected candidate, - expected-RRI calibration, - duplicate-row robustness, - row-shuffle equivariance, - candidate-family robustness, - valid-count sensitivity, - free-shell versus realistic-sampler split. For $Q_H$, report endpoint target gain, oracle-lookahead recovery, per-scene win rate, per-target support bin, and per-candidate-family selected frequency. ## Failure Modes And Mitigations | risk | failure mode | mitigation | |---|---|---| | Absolute-label contamination | The same candidate receives different predicted immediate RRI or TD value depending on unrelated sampled rows. | Keep an independent absolute head and use candidate-to-state residuals before candidate-candidate context. | | Generator overfitting | The model learns candidate-mixture quirks rather than geometry. | Vary ordering and mixture proportions; evaluate on held-out candidate generators. | | Duplicate instability | Adding duplicate candidates changes scores. | Add duplicate stress tests and inspect attention normalization. | | Mask leakage | Invalid rows influence valid scores. | Enforce hard masks in attention, argmax, losses, and bootstrap targets. | | Shortcut learning | The model learns strategy priors instead of target-RRI geometry. | Report per-strategy RRI histograms and with/without-strategy ablations. | ## Thesis Framing The thesis-safe framing is: > The independent VIN-style scorer estimates point samples of a view-utility field over $\mathrm{SE}(3)$. ARIA-NBV first tests whether candidate-to-state queries over target, map, history, and budget evidence improve finite-horizon value prediction, then ablates whether candidate-set context adds useful interaction without corrupting absolute target-gain calibration. This is grounded by VIN-NBV's RRI-based sampled-view ranking [@VIN-NBV-frahm2025], FisherRF's expected-information-gain view of candidate measurements [@FisherRF-jiang2024], Deep Sets and Set Transformer set modeling [@DeepSets-zaheer2017; @SetTransformer-lee2019], and geometric deep learning's symmetry principle [@GeometricDeepLearning-bronstein2021]. The implementation should stay conservative: retain the calibrated independent scorer, use candidate-to-state context for the first residual value head, and let candidate-set interaction affect policy context or residual ablations only after duplicate-row and valid-count tests pass.