1 vin.model_v3
vin.model_v3
VIN v3 one-step RRI scorer with evidence-backed components.
This module implements the active VIN one-step baseline. It predicts per-row RRI scores for a finite candidate set and is used as a myopic scorer/control, not as the thesis finite-horizon value model. The most reliable signal in the current implementation comes from pose encoding, EVL voxel evidence, and semidense projection coverage, so v3 keeps a compact deterministic path with optional trajectory context behind a config flag:
Pose encoding (R6D + LFF): Candidate poses are expressed in the reference rig frame T_rig_ref_cam = T_world_rig_ref^{-1} * T_world_cam and encoded as translation plus rotation-6D with Learnable Fourier Features.
Scene field (fixed channels): The voxel field concatenates occ_pr, cent_pr, counts_norm, occ_input, free_input, and new_surface_prior. We normalize counts as counts_norm = log1p(n) / log1p(max(n)) and define unknown = 1 - counts_norm, new_surface_prior = unknown * occ_pr. This compact field was stable in sweep diagnostics and supports voxel-validity gating.
Global context (pose-conditioned attention): A pooled voxel grid is attended by pose embeddings, with LFF positional keys in the reference rig frame.
Semidense projection stats (VIN-NBV proxy): We project semidense points into each candidate view to compute coverage, empty fraction, visibility fraction, and depth moments. These features act as a lightweight proxy for frustum attention, are concatenated into the scorer input, and drive a tiny CNN over the projection grid for richer cues.
Voxel projection FiLM: Pooled voxel centers are projected into candidate views and summarized; this drives a light FiLM modulation of the global feature (kept as the only view-conditioned modulation).
Optional trajectory context (disabled by default): Snippet rig poses can be encoded and attended by candidate embeddings to provide motion context, mirroring the v2 path without forcing it on the baseline.
CORAL head: A shallow MLP plus CORAL ordinal head produces per-candidate RRI scores.
The forward contract is actor-visible: candidate poses, EVL features, and semidense observations are inputs; oracle RRI labels are training/evaluation targets. Target-conditioned rollout data may reuse the same architecture family, but matched GT targets and target crops must remain outside actor inputs.
Frame-consistency: Candidate generation applies rotate_yaw_cw90 (a local +Z roll) to poses for UI alignment. EVL backbone outputs do not use this convention. VinModelV3 therefore undoes this rotation before computing pose features. If apply_cw90_correction is enabled, callers must pre-correct p3d_cameras and set cw90_corrected=True.
NOTE: vin inputs are typically VinSnippetView with points_world shaped (N,4) or (N,5) containing (x, y, z, 1/sigma_d) with optional n_obs. This file enforces the required XYZ + reliability channel contract to avoid silent failure modes.
Candidate orientation uses R6D pose features. Accumulated target visibility, if added for target-conditioned value learning, should be represented as a separate directional memory over view directions, not folded into the pose encoding.
1.1 Classes
| Name | Description |
|---|---|
| VinModelV3Config | Configuration for VinModelV3 (streamlined one-step VIN baseline). |
| VinModelV3 | VIN-Core head for one-step RRI prediction. |