1 vin.model_v3

vin.model_v3

VIN v3 one-step RRI scorer with evidence-backed components.

This module implements the active VIN one-step baseline. It predicts per-row RRI scores for a finite candidate set and is used as a myopic scorer/control, not as the thesis finite-horizon value model. The most reliable signal in the current implementation comes from pose encoding, EVL voxel evidence, and semidense projection coverage, so v3 keeps a compact deterministic path with optional trajectory context behind a config flag:

  1. Pose encoding (R6D + LFF): Candidate poses are expressed in the reference rig frame T_rig_ref_cam = T_world_rig_ref^{-1} * T_world_cam and encoded as translation plus rotation-6D with Learnable Fourier Features.

  2. Scene field (fixed channels): The voxel field concatenates occ_pr, cent_pr, counts_norm, occ_input, free_input, and new_surface_prior. We normalize counts as counts_norm = log1p(n) / log1p(max(n)) and define unknown = 1 - counts_norm, new_surface_prior = unknown * occ_pr. This compact field was stable in sweep diagnostics and supports voxel-validity gating.

  3. Global context (pose-conditioned attention): A pooled voxel grid is attended by pose embeddings, with LFF positional keys in the reference rig frame.

  4. Semidense projection stats (VIN-NBV proxy): We project semidense points into each candidate view to compute coverage, empty fraction, visibility fraction, and depth moments. These features act as a lightweight proxy for frustum attention, are concatenated into the scorer input, and drive a tiny CNN over the projection grid for richer cues.

  5. Voxel projection FiLM: Pooled voxel centers are projected into candidate views and summarized; this drives a light FiLM modulation of the global feature (kept as the only view-conditioned modulation).

  6. Optional trajectory context (disabled by default): Snippet rig poses can be encoded and attended by candidate embeddings to provide motion context, mirroring the v2 path without forcing it on the baseline.

  7. CORAL head: A shallow MLP plus CORAL ordinal head produces per-candidate RRI scores.

The forward contract is actor-visible: candidate poses, EVL features, and semidense observations are inputs; oracle RRI labels are training/evaluation targets. Target-conditioned rollout data may reuse the same architecture family, but matched GT targets and target crops must remain outside actor inputs.

Frame-consistency: Candidate generation applies rotate_yaw_cw90 (a local +Z roll) to poses for UI alignment. EVL backbone outputs do not use this convention. VinModelV3 therefore undoes this rotation before computing pose features. If apply_cw90_correction is enabled, callers must pre-correct p3d_cameras and set cw90_corrected=True.

NOTE: vin inputs are typically VinSnippetView with points_world shaped (N,4) or (N,5) containing (x, y, z, 1/sigma_d) with optional n_obs. This file enforces the required XYZ + reliability channel contract to avoid silent failure modes.

Candidate orientation uses R6D pose features. Accumulated target visibility, if added for target-conditioned value learning, should be represented as a separate directional memory over view directions, not folded into the pose encoding.

1.1 Classes

Name Description
VinModelV3Config Configuration for VinModelV3 (streamlined one-step VIN baseline).
VinModelV3 VIN-Core head for one-step RRI prediction.