1 vin.model_v3

vin.model_v3

VIN v3 one-step RRI scorer with evidence-backed components.

This module implements the active VIN one-step baseline. It predicts per-row RRI scores for a finite candidate set and is used as a myopic scorer/control, not as the thesis finite-horizon value model. The most reliable signal in the current implementation comes from pose encoding, EVL voxel evidence, and semidense projection coverage, so v3 keeps a compact deterministic path with optional trajectory context behind a config flag:

Pose encoding (R6D + LFF): Candidate poses are expressed in the reference rig frame T_rig_ref_cam = T_world_rig_ref^{-1} * T_world_cam and encoded as translation plus rotation-6D with Learnable Fourier Features.
Scene field (fixed channels): The voxel field concatenates occ_pr, cent_pr, counts_norm, occ_input, free_input, and new_surface_prior. We normalize counts as counts_norm = log1p(n) / log1p(max(n)) and define unknown = 1 - counts_norm, new_surface_prior = unknown * occ_pr. This compact field was stable in sweep diagnostics and supports voxel-validity gating.
Global context (pose-conditioned attention): A pooled voxel grid is attended by pose embeddings, with LFF positional keys in the reference rig frame.
Semidense projection stats (VIN-NBV proxy): We project semidense points into each candidate view to compute coverage, empty fraction, visibility fraction, and depth moments. These features act as a lightweight proxy for frustum attention, are concatenated into the scorer input, and drive a tiny CNN over the projection grid for richer cues.
Voxel projection FiLM: Pooled voxel centers are projected into candidate views and summarized; this drives a light FiLM modulation of the global feature (kept as the only view-conditioned modulation).
Optional trajectory context (disabled by default): Snippet rig poses can be encoded and attended by candidate embeddings to provide motion context, mirroring the v2 path without forcing it on the baseline.
CORAL head: A shallow MLP plus CORAL ordinal head produces per-candidate RRI scores.

The forward contract is actor-visible: candidate poses, EVL features, and semidense observations are inputs; oracle RRI labels are training/evaluation targets. Target-conditioned rollout data may reuse the same architecture family, but matched GT targets and target crops must remain outside actor inputs.

Frame-consistency: Candidate generation applies rotate_yaw_cw90 (a local +Z roll) to poses for UI alignment. EVL backbone outputs do not use this convention. VinModelV3 therefore undoes this rotation before computing pose features. If apply_cw90_correction is enabled, callers must pre-correct p3d_cameras and set cw90_corrected=True.

NOTE: vin inputs are typically VinSnippetView with points_world shaped (N,4) or (N,5) containing (x, y, z, 1/sigma_d) with optional n_obs. This file enforces the required XYZ + reliability channel contract to avoid silent failure modes.

Candidate orientation uses R6D pose features. Accumulated target visibility, if added for target-conditioned value learning, should be represented as a separate directional memory over view directions, not folded into the pose encoding.

1.1 Classes

Name	Description
VinModelV3Config	Configuration for `VinModelV3` (streamlined one-step VIN baseline).
VinModelV3	VIN-Core head for one-step RRI prediction.

# vin.model_v3 { #aria_nbv.vin.model_v3 } `vin.model_v3` VIN v3 one-step RRI scorer with evidence-backed components. This module implements the active VIN one-step baseline. It predicts per-row RRI scores for a finite candidate set and is used as a myopic scorer/control, not as the thesis finite-horizon value model. The most reliable signal in the current implementation comes from pose encoding, EVL voxel evidence, and semidense projection coverage, so v3 keeps a compact deterministic path with optional trajectory context behind a config flag: 1) Pose encoding (R6D + LFF): Candidate poses are expressed in the reference rig frame T_rig_ref_cam = T_world_rig_ref^{-1} * T_world_cam and encoded as translation plus rotation-6D with Learnable Fourier Features. 2) Scene field (fixed channels): The voxel field concatenates occ_pr, cent_pr, counts_norm, occ_input, free_input, and new_surface_prior. We normalize counts as counts_norm = log1p(n) / log1p(max(n)) and define unknown = 1 - counts_norm, new_surface_prior = unknown * occ_pr. This compact field was stable in sweep diagnostics and supports voxel-validity gating. 3) Global context (pose-conditioned attention): A pooled voxel grid is attended by pose embeddings, with LFF positional keys in the reference rig frame. 4) Semidense projection stats (VIN-NBV proxy): We project semidense points into each candidate view to compute coverage, empty fraction, visibility fraction, and depth moments. These features act as a lightweight proxy for frustum attention, are concatenated into the scorer input, and drive a tiny CNN over the projection grid for richer cues. 5) Voxel projection FiLM: Pooled voxel centers are projected into candidate views and summarized; this drives a light FiLM modulation of the global feature (kept as the only view-conditioned modulation). 6) Optional trajectory context (disabled by default): Snippet rig poses can be encoded and attended by candidate embeddings to provide motion context, mirroring the v2 path without forcing it on the baseline. 7) CORAL head: A shallow MLP plus CORAL ordinal head produces per-candidate RRI scores. The forward contract is actor-visible: candidate poses, EVL features, and semidense observations are inputs; oracle RRI labels are training/evaluation targets. Target-conditioned rollout data may reuse the same architecture family, but matched GT targets and target crops must remain outside actor inputs. Frame-consistency: Candidate generation applies rotate_yaw_cw90 (a local +Z roll) to poses for UI alignment. EVL backbone outputs do not use this convention. VinModelV3 therefore undoes this rotation before computing pose features. If apply_cw90_correction is enabled, callers must pre-correct p3d_cameras and set cw90_corrected=True. NOTE: vin inputs are typically VinSnippetView with points_world shaped (N,4) or (N,5) containing (x, y, z, 1/sigma_d) with optional n_obs. This file enforces the required XYZ + reliability channel contract to avoid silent failure modes. Candidate orientation uses R6D pose features. Accumulated target visibility, if added for target-conditioned value learning, should be represented as a separate directional memory over view directions, not folded into the pose encoding. ## Classes | Name | Description | | --- | --- | | [VinModelV3Config](aria_nbv.vin.model_v3.VinModelV3Config.qmd#aria_nbv.vin.model_v3.VinModelV3Config) | Configuration for `VinModelV3` (streamlined one-step VIN baseline). | | [VinModelV3](aria_nbv.vin.model_v3.VinModelV3.qmd#aria_nbv.vin.model_v3.VinModelV3) | VIN-Core head for one-step RRI prediction. |