VIN v2 Feature + Encoding Proposals

1 Scope

This document summarizes concrete feature sets and encoding schemas to explore for VinModelV2, with a focus on mitigating EVL’s local voxel grid limits and improving RRI prediction quality. It is intended as an implementation-oriented checklist for ablations and incremental upgrades.

Related context:

Implementation overview: VIN on EVL
EFM3D/EVL background: EFM3D
VIN-NBV paper summary: VIN-NBV [1]

2 Context and constraints

EVL exposes a local, fixed-size voxel grid (default ~4x4x4 m) aligned to the last pose and anchored in a gravity-aligned frame. This local extent is expected but means many candidates can be partially or fully out-of-bounds. [2]
VinModelV2 currently uses pose R6D + translation, scene field channels from EVL heads/evidence, and pose-conditioned global pooling. See oracle_rri/oracle_rri/vin/model_v2.py.
VIN predicts RRI as an ordinal regression problem via CORAL [3]. Wikipedia :: Ordinal regression

3 Design goals

Provide high-signal scene cues for RRI without over-reliance on heavy neck features.
Remain robust when candidates fall outside the EVL voxel extent.
Preserve interpretability (feature channels map to clear geometry/evidence semantics).
Keep additions incremental and ablation-friendly.

4 Proposed feature bundles

The bundles below are additive. Start with P0/P1; add P2/P3 only if needed; P4 can be always-on, and P5 is optional when appearance priors are desired.

4.1 P0: Coverage + occupancy core (baseline++)

Inputs (from EvlBackboneOutput):

occ_pr, occ_input, free_input, counts_norm, unknown, new_surface_prior

Candidate-level scalars:

valid_frac (fraction of in-bounds samples),
center_in_bounds (center validity),
signed distance to voxel bounds (x/y/z).

Motivation: Explicitly communicates what is known vs. unknown in the local grid, and when the voxel context is unreliable.

4.2 P1: Boundary + uncertainty cues (derived channels)

Derived channels:

surface_boundary = occ_pr * free_input
uncertainty = occ_pr * (1 - occ_pr)

Motivation: Lightweight proxies for surface complexity and model uncertainty, without adding neck features.

4.3 P2: Compressed neck features

Inputs:

occ_feat (and optionally obb_feat) with 1x1x1 Conv compression (e.g., 64 -> 8/16 channels).

Motivation: Adds richer 3D semantics while keeping compute in check. Use only if P0/P1 saturate.

4.4 P3: Entity-aware cues (OBB-aware VIN)

Inputs:

obb_pred, obb_pred_probs_full (decoded OBBs + class probs).

Candidate-level scalars:

distance to nearest OBB center,
view alignment to OBB axes,
fraction of the candidate view frustum intersecting top-k OBBs Wikipedia :: View frustum,
top-k semantic class probabilities.

Motivation: Supports entity-aware NBV objectives and task-specific weighting.

4.5 P4: Semidense projection features (always-on, voxel-independent)

Inputs:

points/p3s_world (semi-dense SLAM points, i.e. a point cloud Wikipedia :: Point cloud) plus camera intrinsics and candidate poses for projection.

Per-candidate features:

projected coverage/empty fraction (F_empty),
depth statistics (mean/variance/percentiles),
visibility density per view (histogram or angular bins).

Motivation: These features are candidate-agnostic to voxel extent and should help even when the EVL grid is small. They mirror VIN-NBV’s view-projection features and can be used for every candidate. [1]

4.5.1 Theory: projection-based coverage as an RRI proxy

Given a candidate camera with intrinsics/extrinsics and a point cloud of the current reconstruction, we project 3D points into the candidate image plane using a standard camera model (pinhole or fisheye) Wikipedia :: Pinhole camera model. Let I be the image grid, and M(q) the set of pixels hit by at least one projected 3D point for candidate q. A simple coverage proxy is:

\[ F_{\\text{empty}}(q) = 1 - \\frac{|M(q)|}{|I|} \]

Low coverage (high F_empty) indicates that large parts of the candidate image would reveal new geometry, which correlates with potential reconstruction improvement. This is the core intuition behind VIN-NBV’s coverage feature and remains useful even when voxel features are unreliable.

4.5.2 Suggested injection point (VinModelV2)

Treat semidense projection features as per-candidate scalars and inject them in two places:

FiLM-modulate the pose-conditioned global voxel context (global_feat) so the semidense coverage cues can reweight voxel features when the EVL grid is sparse or out-of-bounds.
Concatenate the raw projection features alongside pose_enc and global_feat before the MLP head.

The current VinModelV2 implementation uses a lightweight FiLM (linear -> γ/β) plus a direct concat:

# VinModelV2._forward_impl
sem_proj = self._semidense_projection_features(...)
gamma, beta = self.sem_proj_film(sem_proj).chunk(2, dim=-1)
global_feat = global_feat * (1.0 + gamma) + beta
parts.append(sem_proj)

This keeps the scene field untouched and avoids reintroducing heavy 3D volumes.

4.5.3 Practical recipe (using EfmPointsView::collapse_points)

Use EfmPointsView.collapse_points() to collapse time and subsample the point cloud before projection:

from efm3d.aria.camera import CameraTW
from efm3d.aria.pose import PoseTW

def project_semidense_features(
    points_view: EfmPointsView,
    candidate_poses_world_cam: PoseTW,   # world <- cam (B, N, 12)
    candidate_camera: CameraTW,          # intrinsics (per candidate or shared)
    image_size: tuple[int, int],         # (H, W)
    max_points: int = 50000,
) -> Tensor:
    # 1) Collapse points across time (subsampled)
    pts_world = points_view.collapse_points(max_points=max_points)  # (K, 3)
    if pts_world.numel() == 0:
        return torch.zeros((candidate_poses_world_cam.shape[0],
                            candidate_poses_world_cam.shape[1],
                            3), device=pts_world.device)

    # 2) Transform into each candidate camera frame
    pose_cam_world = candidate_poses_world_cam.inverse()  # cam <- world
    pts_cam = pose_cam_world[:, :, None] * pts_world      # (B, N, K, 3)

    # 3) Project to pixels + validity mask
    p2d, valid = candidate_camera.project(pts_cam)        # (B, N, K, 2), (B, N, K)

    # 4) Coverage / empty fraction (sketch)
    H, W = image_size
    in_bounds = valid & (p2d[..., 0] >= 0) & (p2d[..., 0] < W) & (p2d[..., 1] >= 0) & (p2d[..., 1] < H)
    # Convert to integer pixel coords and build a sparse occupancy mask.
    # Use scatter or bincount for efficiency.
    # Compute |M(q)| / (H*W) and derive F_empty.
    # Also collect depth statistics from pts_cam[..., 2] where in_bounds is True.
    ...

Notes:

CameraTW.project supports both pinhole and fisheye models and returns a validity mask (in-front + in-image + in-radius). This avoids manual distortion handling. See external/efm3d/efm3d/aria/camera.py for details.
If candidate_camera is shared across candidates, broadcast the camera to (B, N, ...) or slice per candidate frame index.
For performance, compute coverage using torch.bincount on p2d_int = (y * W + x) with per-candidate batching.

4.5.4 Optional: PointNeXt-S semidense encoder (global point embedding)

For a stronger geometric prior than simple projection stats, we can encode the semi-dense point cloud with PointNeXt-S (small, pretrained, robust to non-uniform sampling). We subsample to ~3k points (from the padded 50k) and encode the point cloud once per snippet, then concatenate the embedding to the VIN head input.

Key references:

PointNeXt paper: [4]
OpenPoints model zoo (pretrained PointNeXt-S configs/weights): [5]

Implementation notes:

The encoder is optional and only active if VinModelV2Config.point_encoder is set.
Use EfmPointsView.collapse_points(max_points=3000) before encoding.
Transform points into the reference rig frame for consistent pose alignment.

4.5.5 FiLM-style modulation of the global voxel context (recommended injection)

Rather than only late-fusing the PointNeXt embedding, use it to modulate the voxel-derived global context. This is a feature-wise linear modulation (FiLM) style injection [6] that conditions the global context on the semidense geometry without disrupting the interpretable voxel channels.

Motivation (conceptual):

The voxel field provides local occupancy evidence; the semidense embedding summarizes global geometry. We want the global geometry to shape how the voxel context is interpreted rather than merely append another vector at the end.
FiLM-style modulation implements content-aware gating: the point cloud decides which global-context features should be amplified, suppressed, or shifted. This is analogous to conditional normalization, but keeps the base representation intact and interpretable.
It is parameter-efficient and stable: one linear layer produces scale/shift and can be paired with GroupNorm for controlled statistics.

Formulation (per snippet, broadcast to candidates):

Let g ∈ R^{B×N×C} be the global voxel context and p ∈ R^{B×D} the PointNeXt embedding. We compute:

\[ \\gamma, \\beta = W p, \\quad g' = \\text{GN}( g \\odot (1 + \\gamma) + \\beta ) \]

where γ, β ∈ R^{B×C} are broadcast across candidates and GN is GroupNorm.

Implementation sketch:

# VinModelV2.__init__
self.point_film = nn.Linear(point_dim, 2 * field_dim)
self.point_film_norm = nn.GroupNorm(num_groups=4, num_channels=field_dim)

# VinModelV2._forward_impl
gamma, beta = self.point_film(semidense_feat).chunk(2, dim=-1)
global_feat = global_feat * (1.0 + gamma[:, None, :]) + beta[:, None, :]
global_feat = self.point_film_norm(global_feat.transpose(1, 2)).transpose(1, 2)

Why this helps:

When the voxel grid misses geometry (out-of-bounds candidates), the point embedding can down-weight unreliable global features.
When semidense points indicate rich structure, the modulation amplifies the relevant global-context channels, improving RRI discrimination.

4.5.6 Optional: encoder for past ego trajectory (global point embedding)

Use tiny transformer to encode past ego trajectory frames in reference rig frame.

@dataclass(slots=True)
class EfmTrajectoryView:
    """World-frame rig trajectory aligned to snippet frames."""

    t_world_rig: PoseTW
    """Rig SE(3) poses per frame (world←rig)."""
    time_ns: Tensor
    """``Tensor["F", int64]`` pose timestamps."""
    gravity_in_world: Tensor
    """``Tensor["3", float32]`` gravity vector in world frame (aligned to [0,0,-9.81])."""

4.6 P5: 2D token reuse (appearance priors)

Inputs:

feat2d_upsampled or token2d from EVL outputs.

Mechanism:

Project semidense points into current RGB frames, sample 2D features, reproject to candidate view or pool directly per candidate.

Motivation: Adds appearance/texture cues without expanding the voxel grid.

5 Encoding schemas

5.1 Candidate-relative positional keys

Idea: Build pos_grid in the candidate frame so attention keys and candidate queries share the same spatial basis. This should improve alignment when the candidate is far from the reference frame.

Implementation anchor: VinModelV2Config.tf_pos_grid_in_candidate_frame (TODO in model_v2.py).

5.2 Hybrid pose encoding (R6D + shell)

Idea: Combine the current R6D+translation encoding with shell features (direction u, forward f, radius r, alignment). This keeps rotation/translation expressivity while adding geometry cues independent of voxel extent.

5.3 Distance-to-voxel features

Idea: Add explicit scalars for candidate distance to voxel center and normalized signed distance to bounds (x/y/z). This provides a direct indicator of voxel-context reliability.

5.4 Multi-scale pooling

Idea: Pool the voxel field at 2-3 grid sizes and concatenate. This helps when only a subregion of the voxel grid is informative.

5.5 Pose-conditioned attention diagnostics

The global pooling uses multi-head attention Wikipedia :: Attention (machine learning). Exposing attention weights lets us assess whether candidates attend to spatially meaningful voxels and whether out-of-grid candidates produce diffuse (uninformative) attention.

5.5.1 Theory: attention concentration as a diagnostic

Let A(q) be the attention matrix for candidate q (queries = candidate pose tokens, keys = pooled voxel tokens). If attention is well-aligned, A(q) should concentrate on a few spatial tokens. If the candidate is out-of-grid or the positional encoding is misaligned, A(q) becomes uniform.

A simple diagnostic is attention entropy:

\[ H(q) = -\\sum_i A_i(q) \\log (A_i(q) + \\varepsilon) \]

Low entropy indicates focused attention; high entropy suggests that the attention mechanism has no spatial anchor for that candidate.

5.5.2 Implementation sketch (PoseConditionedGlobalPool)

torch.nn.MultiheadAttention can return per-head attention weights by passing need_weights=True and average_attn_weights=False:

# oracle_rri/oracle_rri/vin/model_v2.py
attn_out, attn_weights = self.attn(
    queries, keys, keys,
    need_weights=True,
    average_attn_weights=False,  # keep per-head weights
)
# attn_weights: (B, num_heads, N, T)

From this, compute diagnostics per candidate:

weights = attn_weights.mean(dim=1)  # (B, N, T) average heads
entropy = -(weights * (weights + 1e-9).log()).sum(dim=-1)  # (B, N)
peak = weights.max(dim=-1).values  # (B, N)

Log entropy and peak alongside valid_frac to check whether attention degenerates when coverage is low.

6 Out-of-voxel mitigation recipe

A simple gating strategy can prevent over-confident predictions when voxel coverage is low:

Compute valid_frac and center_in_bounds.
If valid_frac < tau (e.g., 0.2), blend voxel features with P4/P5 features:
- feat = alpha * voxel_feat + (1 - alpha) * semidense_feat where alpha = clamp(valid_frac / tau, 0, 1).
Add a binary OOB indicator as input to the head.

7 Suggested ablation order

P0 baseline++ (coverage + occupancy + explicit OOB scalars).
P1 derived channels.
P2 compressed neck features (if P1 gains plateau).
P4 semidense projection features (always-on).
P3 entity-aware cues.
P5 2D token reuse (heavier pipeline but potentially strong gains).

8 Implementation anchors

Scene field construction: oracle_rri/oracle_rri/vin/model_v2.py::_build_scene_field_v2.
Global pooling + positional keys: PoseConditionedGlobalPool in model_v2.py.
EVL feature contract: oracle_rri/oracle_rri/vin/types.py::EvlBackboneOutput.
VIN documentation: VIN on EVL.

References

[1]

N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219

[2]

J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224

[3]

W. Cao, V. Mirjalili, and S. Raschka, “Rank consistent ordinal regression for neural networks with application to age estimation.” 2019. Available: https://arxiv.org/abs/1901.07884

[4]

G. Qian et al., “PointNeXt: Revisiting PointNet++ with improved training and scaling strategies.” [Online]. Available: https://arxiv.org/abs/2206.04670

[5]

OpenPoints contributors, “OpenPoints: PointNeXt model zoo.” [Online]. Available: https://guochengqian.github.io/PointNeXt/modelzoo/

[6]

E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, 2018. Available: https://arxiv.org/abs/1709.07871

--- title: "VIN v2 Feature + Encoding Proposals" format: html --- # Scope This document summarizes concrete feature sets and encoding schemas to explore for **VinModelV2**, with a focus on mitigating EVL's local voxel grid limits and improving RRI prediction quality. It is intended as an implementation-oriented checklist for ablations and incremental upgrades. Related context: - Implementation overview: [VIN on EVL](vin_nbv.qmd) - EFM3D/EVL background: [EFM3D](../literature/efm3d.qmd) - VIN-NBV paper summary: [VIN-NBV](../literature/vin_nbv.qmd) [@VIN-NBV-frahm2025] # Context and constraints - EVL exposes a **local, fixed-size voxel grid** (default ~4x4x4 m) aligned to the last pose and anchored in a gravity-aligned frame. This local extent is expected but means many candidates can be partially or fully out-of-bounds. [@EFM3D-straub2024] - VinModelV2 currently uses **pose R6D + translation**, **scene field channels** from EVL heads/evidence, and **pose-conditioned global pooling**. See `oracle_rri/oracle_rri/vin/model_v2.py`. - VIN predicts RRI as an **ordinal regression** problem via CORAL [@CORAL-cao2019]. [Wikipedia :: Ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression) # Design goals - Provide **high-signal scene cues** for RRI without over-reliance on heavy neck features. - Remain robust when candidates **fall outside the EVL voxel extent**. - Preserve **interpretability** (feature channels map to clear geometry/evidence semantics). - Keep additions incremental and ablation-friendly. # Proposed feature bundles The bundles below are additive. Start with P0/P1; add P2/P3 only if needed; P4 can be always-on, and P5 is optional when appearance priors are desired. ## P0: Coverage + occupancy core (baseline++) **Inputs** (from `EvlBackboneOutput`): - `occ_pr`, `occ_input`, `free_input`, `counts_norm`, `unknown`, `new_surface_prior` **Candidate-level scalars**: - `valid_frac` (fraction of in-bounds samples), - `center_in_bounds` (center validity), - signed distance to voxel bounds (x/y/z). **Motivation**: Explicitly communicates *what is known* vs. *unknown* in the local grid, and when the voxel context is unreliable. ## P1: Boundary + uncertainty cues (derived channels) **Derived channels**: - `surface_boundary = occ_pr * free_input` - `uncertainty = occ_pr * (1 - occ_pr)` **Motivation**: Lightweight proxies for surface complexity and model uncertainty, without adding neck features. ## P2: Compressed neck features **Inputs**: - `occ_feat` (and optionally `obb_feat`) with 1x1x1 Conv compression (e.g., 64 -> 8/16 channels). **Motivation**: Adds richer 3D semantics while keeping compute in check. Use only if P0/P1 saturate. ## P3: Entity-aware cues (OBB-aware VIN) **Inputs**: - `obb_pred`, `obb_pred_probs_full` (decoded OBBs + class probs). **Candidate-level scalars**: - distance to nearest OBB center, - view alignment to OBB axes, - fraction of the candidate view frustum intersecting top-k OBBs [Wikipedia :: View frustum](https://en.wikipedia.org/wiki/View_frustum), - top-k semantic class probabilities. **Motivation**: Supports entity-aware NBV objectives and task-specific weighting. ## P4: Semidense projection features (always-on, voxel-independent) **Inputs**: - `points/p3s_world` (semi-dense SLAM points, i.e. a **point cloud** [Wikipedia :: Point cloud](https://en.wikipedia.org/wiki/Point_cloud)) plus camera intrinsics and candidate poses for projection. **Per-candidate features**: - projected coverage/empty fraction (`F_empty`), - depth statistics (mean/variance/percentiles), - visibility density per view (histogram or angular bins). **Motivation**: These features are **candidate-agnostic to voxel extent** and should help even when the EVL grid is small. They mirror VIN-NBV's view-projection features and can be used for every candidate. [@VIN-NBV-frahm2025] ### Theory: projection-based coverage as an RRI proxy Given a candidate camera with intrinsics/extrinsics and a point cloud of the current reconstruction, we project 3D points into the candidate image plane using a standard camera model (pinhole or fisheye) [Wikipedia :: Pinhole camera model](https://en.wikipedia.org/wiki/Pinhole_camera_model). Let `I` be the image grid, and `M(q)` the set of pixels hit by at least one projected 3D point for candidate `q`. A simple coverage proxy is: $$ F_{\\text{empty}}(q) = 1 - \\frac{|M(q)|}{|I|} $$ Low coverage (high `F_empty`) indicates that large parts of the candidate image would reveal **new** geometry, which correlates with potential reconstruction improvement. This is the core intuition behind VIN-NBV's coverage feature and remains useful even when voxel features are unreliable. ### Suggested injection point (VinModelV2) Treat semidense projection features as **per-candidate scalars** and inject them in two places: 1. **FiLM-modulate** the pose-conditioned global voxel context (`global_feat`) so the semidense coverage cues can *reweight* voxel features when the EVL grid is sparse or out-of-bounds. 2. **Concatenate** the raw projection features alongside `pose_enc` and `global_feat` before the MLP head. The current VinModelV2 implementation uses a lightweight FiLM (linear -> γ/β) plus a direct concat: ```python # VinModelV2._forward_impl sem_proj = self._semidense_projection_features(...) gamma, beta = self.sem_proj_film(sem_proj).chunk(2, dim=-1) global_feat = global_feat * (1.0 + gamma) + beta parts.append(sem_proj) ``` This keeps the scene field untouched and avoids reintroducing heavy 3D volumes. ### Practical recipe (using EfmPointsView::collapse_points) Use `EfmPointsView.collapse_points()` to collapse time and subsample the point cloud before projection: ```python from efm3d.aria.camera import CameraTW from efm3d.aria.pose import PoseTW def project_semidense_features( points_view: EfmPointsView, candidate_poses_world_cam: PoseTW, # world <- cam (B, N, 12) candidate_camera: CameraTW, # intrinsics (per candidate or shared) image_size: tuple[int, int], # (H, W) max_points: int = 50000, ) -> Tensor: # 1) Collapse points across time (subsampled) pts_world = points_view.collapse_points(max_points=max_points) # (K, 3) if pts_world.numel() == 0: return torch.zeros((candidate_poses_world_cam.shape[0], candidate_poses_world_cam.shape[1], 3), device=pts_world.device) # 2) Transform into each candidate camera frame pose_cam_world = candidate_poses_world_cam.inverse() # cam <- world pts_cam = pose_cam_world[:, :, None] * pts_world # (B, N, K, 3) # 3) Project to pixels + validity mask p2d, valid = candidate_camera.project(pts_cam) # (B, N, K, 2), (B, N, K) # 4) Coverage / empty fraction (sketch) H, W = image_size in_bounds = valid & (p2d[..., 0] >= 0) & (p2d[..., 0] < W) & (p2d[..., 1] >= 0) & (p2d[..., 1] < H) # Convert to integer pixel coords and build a sparse occupancy mask. # Use scatter or bincount for efficiency. # Compute |M(q)| / (H*W) and derive F_empty. # Also collect depth statistics from pts_cam[..., 2] where in_bounds is True. ... ``` Notes: - `CameraTW.project` supports both pinhole and fisheye models and returns a validity mask (in-front + in-image + in-radius). This avoids manual distortion handling. See `external/efm3d/efm3d/aria/camera.py` for details. - If `candidate_camera` is shared across candidates, broadcast the camera to `(B, N, ...)` or slice per candidate frame index. - For performance, compute coverage using `torch.bincount` on `p2d_int = (y * W + x)` with per-candidate batching. ### Optional: PointNeXt-S semidense encoder (global point embedding) For a stronger geometric prior than simple projection stats, we can encode the semi-dense point cloud with **PointNeXt-S** (small, pretrained, robust to non-uniform sampling). We subsample to ~3k points (from the padded 50k) and encode the point cloud once per snippet, then concatenate the embedding to the VIN head input. Key references: - PointNeXt paper: [@PointNeXt-qian2022] - OpenPoints model zoo (pretrained PointNeXt-S configs/weights): [@OpenPoints-modelzoo] Implementation notes: - The encoder is **optional** and only active if `VinModelV2Config.point_encoder` is set. - Use `EfmPointsView.collapse_points(max_points=3000)` before encoding. - Transform points into the **reference rig frame** for consistent pose alignment. ### FiLM-style modulation of the global voxel context (recommended injection) Rather than only late-fusing the PointNeXt embedding, use it to **modulate** the voxel-derived global context. This is a feature-wise linear modulation (FiLM) style injection [@FiLM-perez2018] that conditions the global context on the semidense geometry *without* disrupting the interpretable voxel channels. **Motivation (conceptual)**: - The voxel field provides **local occupancy evidence**; the semidense embedding summarizes **global geometry**. We want the global geometry to *shape* how the voxel context is interpreted rather than merely append another vector at the end. - FiLM-style modulation implements **content-aware gating**: the point cloud decides which global-context features should be amplified, suppressed, or shifted. This is analogous to conditional normalization, but keeps the base representation intact and interpretable. - It is **parameter-efficient** and stable: one linear layer produces scale/shift and can be paired with GroupNorm for controlled statistics. **Formulation** (per snippet, broadcast to candidates): Let `g ∈ R^{B×N×C}` be the global voxel context and `p ∈ R^{B×D}` the PointNeXt embedding. We compute: $$ \\gamma, \\beta = W p, \\quad g' = \\text{GN}( g \\odot (1 + \\gamma) + \\beta ) $$ where `γ, β ∈ R^{B×C}` are broadcast across candidates and `GN` is GroupNorm. **Implementation sketch**: ```python # VinModelV2.__init__ self.point_film = nn.Linear(point_dim, 2 * field_dim) self.point_film_norm = nn.GroupNorm(num_groups=4, num_channels=field_dim) # VinModelV2._forward_impl gamma, beta = self.point_film(semidense_feat).chunk(2, dim=-1) global_feat = global_feat * (1.0 + gamma[:, None, :]) + beta[:, None, :] global_feat = self.point_film_norm(global_feat.transpose(1, 2)).transpose(1, 2) ``` **Why this helps**: - When the voxel grid misses geometry (out-of-bounds candidates), the point embedding can **down-weight unreliable global features**. - When semidense points indicate rich structure, the modulation **amplifies** the relevant global-context channels, improving RRI discrimination. ### Optional: encoder for past ego trajectory (global point embedding) Use tiny transformer to encode past ego trajectory frames in reference rig frame. ```py @dataclass(slots=True) class EfmTrajectoryView: """World-frame rig trajectory aligned to snippet frames.""" t_world_rig: PoseTW """Rig SE(3) poses per frame (world←rig).""" time_ns: Tensor """``Tensor["F", int64]`` pose timestamps.""" gravity_in_world: Tensor """``Tensor["3", float32]`` gravity vector in world frame (aligned to [0,0,-9.81]).""" ``` ## P5: 2D token reuse (appearance priors) **Inputs**: - `feat2d_upsampled` or `token2d` from EVL outputs. **Mechanism**: - Project semidense points into current RGB frames, sample 2D features, reproject to candidate view or pool directly per candidate. **Motivation**: Adds appearance/texture cues without expanding the voxel grid. # Encoding schemas ## Candidate-relative positional keys **Idea**: Build `pos_grid` in the **candidate frame** so attention keys and candidate queries share the same spatial basis. This should improve alignment when the candidate is far from the reference frame. - Implementation anchor: `VinModelV2Config.tf_pos_grid_in_candidate_frame` (TODO in `model_v2.py`). ## Hybrid pose encoding (R6D + shell) **Idea**: Combine the current R6D+translation encoding with shell features (direction `u`, forward `f`, radius `r`, alignment). This keeps rotation/translation expressivity while adding geometry cues independent of voxel extent. ## Distance-to-voxel features **Idea**: Add explicit scalars for candidate distance to voxel center and normalized signed distance to bounds (x/y/z). This provides a direct indicator of voxel-context reliability. ## Multi-scale pooling **Idea**: Pool the voxel field at 2-3 grid sizes and concatenate. This helps when only a subregion of the voxel grid is informative. ## Pose-conditioned attention diagnostics The global pooling uses multi-head attention [Wikipedia :: Attention (machine learning)](https://en.wikipedia.org/wiki/Attention_(machine_learning)). Exposing attention weights lets us assess whether candidates attend to spatially meaningful voxels and whether out-of-grid candidates produce **diffuse** (uninformative) attention. ### Theory: attention concentration as a diagnostic Let `A(q)` be the attention matrix for candidate `q` (queries = candidate pose tokens, keys = pooled voxel tokens). If attention is well-aligned, `A(q)` should concentrate on a few spatial tokens. If the candidate is out-of-grid or the positional encoding is misaligned, `A(q)` becomes uniform. A simple diagnostic is **attention entropy**: $$ H(q) = -\\sum_i A_i(q) \\log (A_i(q) + \\varepsilon) $$ Low entropy indicates focused attention; high entropy suggests that the attention mechanism has no spatial anchor for that candidate. ### Implementation sketch (PoseConditionedGlobalPool) `torch.nn.MultiheadAttention` can return per-head attention weights by passing `need_weights=True` and `average_attn_weights=False`: ```python # oracle_rri/oracle_rri/vin/model_v2.py attn_out, attn_weights = self.attn( queries, keys, keys, need_weights=True, average_attn_weights=False, # keep per-head weights ) # attn_weights: (B, num_heads, N, T) ``` From this, compute diagnostics per candidate: ```python weights = attn_weights.mean(dim=1) # (B, N, T) average heads entropy = -(weights * (weights + 1e-9).log()).sum(dim=-1) # (B, N) peak = weights.max(dim=-1).values # (B, N) ``` Log `entropy` and `peak` alongside `valid_frac` to check whether attention degenerates when coverage is low. # Out-of-voxel mitigation recipe A simple gating strategy can prevent over-confident predictions when voxel coverage is low: 1. Compute `valid_frac` and `center_in_bounds`. 2. If `valid_frac < tau` (e.g., 0.2), blend voxel features with P4/P5 features: - `feat = alpha * voxel_feat + (1 - alpha) * semidense_feat` where `alpha = clamp(valid_frac / tau, 0, 1)`. 3. Add a binary OOB indicator as input to the head. # Suggested ablation order 1. P0 baseline++ (coverage + occupancy + explicit OOB scalars). 2. P1 derived channels. 3. P2 compressed neck features (if P1 gains plateau). 4. P4 semidense projection features (always-on). 5. P3 entity-aware cues. 6. P5 2D token reuse (heavier pipeline but potentially strong gains). # Implementation anchors - Scene field construction: `oracle_rri/oracle_rri/vin/model_v2.py::_build_scene_field_v2`. - Global pooling + positional keys: `PoseConditionedGlobalPool` in `model_v2.py`. - EVL feature contract: `oracle_rri/oracle_rri/vin/types.py::EvlBackboneOutput`. - VIN documentation: [VIN on EVL](vin_nbv.qmd).