VIN v2 Feature + Encoding Proposals
1 Scope
This document summarizes concrete feature sets and encoding schemas to explore for VinModelV2, with a focus on mitigating EVL’s local voxel grid limits and improving RRI prediction quality. It is intended as an implementation-oriented checklist for ablations and incremental upgrades.
Related context:
- Implementation overview: VIN on EVL
- EFM3D/EVL background: EFM3D
- VIN-NBV paper summary: VIN-NBV [1]
2 Context and constraints
- EVL exposes a local, fixed-size voxel grid (default ~4x4x4 m) aligned to the last pose and anchored in a gravity-aligned frame. This local extent is expected but means many candidates can be partially or fully out-of-bounds. [2]
- VinModelV2 currently uses pose R6D + translation, scene field channels from EVL heads/evidence, and pose-conditioned global pooling. See
oracle_rri/oracle_rri/vin/model_v2.py. - VIN predicts RRI as an ordinal regression problem via CORAL [3]. Wikipedia :: Ordinal regression
3 Design goals
- Provide high-signal scene cues for RRI without over-reliance on heavy neck features.
- Remain robust when candidates fall outside the EVL voxel extent.
- Preserve interpretability (feature channels map to clear geometry/evidence semantics).
- Keep additions incremental and ablation-friendly.
4 Proposed feature bundles
The bundles below are additive. Start with P0/P1; add P2/P3 only if needed; P4 can be always-on, and P5 is optional when appearance priors are desired.
4.1 P0: Coverage + occupancy core (baseline++)
Inputs (from EvlBackboneOutput):
occ_pr,occ_input,free_input,counts_norm,unknown,new_surface_prior
Candidate-level scalars:
valid_frac(fraction of in-bounds samples),center_in_bounds(center validity),- signed distance to voxel bounds (x/y/z).
Motivation: Explicitly communicates what is known vs. unknown in the local grid, and when the voxel context is unreliable.
4.2 P1: Boundary + uncertainty cues (derived channels)
Derived channels:
surface_boundary = occ_pr * free_inputuncertainty = occ_pr * (1 - occ_pr)
Motivation: Lightweight proxies for surface complexity and model uncertainty, without adding neck features.
4.3 P2: Compressed neck features
Inputs:
occ_feat(and optionallyobb_feat) with 1x1x1 Conv compression (e.g., 64 -> 8/16 channels).
Motivation: Adds richer 3D semantics while keeping compute in check. Use only if P0/P1 saturate.
4.4 P3: Entity-aware cues (OBB-aware VIN)
Inputs:
obb_pred,obb_pred_probs_full(decoded OBBs + class probs).
Candidate-level scalars:
- distance to nearest OBB center,
- view alignment to OBB axes,
- fraction of the candidate view frustum intersecting top-k OBBs Wikipedia :: View frustum,
- top-k semantic class probabilities.
Motivation: Supports entity-aware NBV objectives and task-specific weighting.
4.5 P4: Semidense projection features (always-on, voxel-independent)
Inputs:
points/p3s_world(semi-dense SLAM points, i.e. a point cloud Wikipedia :: Point cloud) plus camera intrinsics and candidate poses for projection.
Per-candidate features:
- projected coverage/empty fraction (
F_empty), - depth statistics (mean/variance/percentiles),
- visibility density per view (histogram or angular bins).
Motivation: These features are candidate-agnostic to voxel extent and should help even when the EVL grid is small. They mirror VIN-NBV’s view-projection features and can be used for every candidate. [1]
4.5.1 Theory: projection-based coverage as an RRI proxy
Given a candidate camera with intrinsics/extrinsics and a point cloud of the current reconstruction, we project 3D points into the candidate image plane using a standard camera model (pinhole or fisheye) Wikipedia :: Pinhole camera model. Let I be the image grid, and M(q) the set of pixels hit by at least one projected 3D point for candidate q. A simple coverage proxy is:
\[ F_{\\text{empty}}(q) = 1 - \\frac{|M(q)|}{|I|} \]
Low coverage (high F_empty) indicates that large parts of the candidate image would reveal new geometry, which correlates with potential reconstruction improvement. This is the core intuition behind VIN-NBV’s coverage feature and remains useful even when voxel features are unreliable.
4.5.2 Suggested injection point (VinModelV2)
Treat semidense projection features as per-candidate scalars and inject them in two places:
- FiLM-modulate the pose-conditioned global voxel context (
global_feat) so the semidense coverage cues can reweight voxel features when the EVL grid is sparse or out-of-bounds. - Concatenate the raw projection features alongside
pose_encandglobal_featbefore the MLP head.
The current VinModelV2 implementation uses a lightweight FiLM (linear -> γ/β) plus a direct concat:
# VinModelV2._forward_impl
sem_proj = self._semidense_projection_features(...)
gamma, beta = self.sem_proj_film(sem_proj).chunk(2, dim=-1)
global_feat = global_feat * (1.0 + gamma) + beta
parts.append(sem_proj)This keeps the scene field untouched and avoids reintroducing heavy 3D volumes.
4.5.3 Practical recipe (using EfmPointsView::collapse_points)
Use EfmPointsView.collapse_points() to collapse time and subsample the point cloud before projection:
from efm3d.aria.camera import CameraTW
from efm3d.aria.pose import PoseTW
def project_semidense_features(
points_view: EfmPointsView,
candidate_poses_world_cam: PoseTW, # world <- cam (B, N, 12)
candidate_camera: CameraTW, # intrinsics (per candidate or shared)
image_size: tuple[int, int], # (H, W)
max_points: int = 50000,
) -> Tensor:
# 1) Collapse points across time (subsampled)
pts_world = points_view.collapse_points(max_points=max_points) # (K, 3)
if pts_world.numel() == 0:
return torch.zeros((candidate_poses_world_cam.shape[0],
candidate_poses_world_cam.shape[1],
3), device=pts_world.device)
# 2) Transform into each candidate camera frame
pose_cam_world = candidate_poses_world_cam.inverse() # cam <- world
pts_cam = pose_cam_world[:, :, None] * pts_world # (B, N, K, 3)
# 3) Project to pixels + validity mask
p2d, valid = candidate_camera.project(pts_cam) # (B, N, K, 2), (B, N, K)
# 4) Coverage / empty fraction (sketch)
H, W = image_size
in_bounds = valid & (p2d[..., 0] >= 0) & (p2d[..., 0] < W) & (p2d[..., 1] >= 0) & (p2d[..., 1] < H)
# Convert to integer pixel coords and build a sparse occupancy mask.
# Use scatter or bincount for efficiency.
# Compute |M(q)| / (H*W) and derive F_empty.
# Also collect depth statistics from pts_cam[..., 2] where in_bounds is True.
...Notes:
CameraTW.projectsupports both pinhole and fisheye models and returns a validity mask (in-front + in-image + in-radius). This avoids manual distortion handling. Seeexternal/efm3d/efm3d/aria/camera.pyfor details.- If
candidate_camerais shared across candidates, broadcast the camera to(B, N, ...)or slice per candidate frame index. - For performance, compute coverage using
torch.bincountonp2d_int = (y * W + x)with per-candidate batching.
4.5.4 Optional: PointNeXt-S semidense encoder (global point embedding)
For a stronger geometric prior than simple projection stats, we can encode the semi-dense point cloud with PointNeXt-S (small, pretrained, robust to non-uniform sampling). We subsample to ~3k points (from the padded 50k) and encode the point cloud once per snippet, then concatenate the embedding to the VIN head input.
Key references:
Implementation notes:
- The encoder is optional and only active if
VinModelV2Config.point_encoderis set. - Use
EfmPointsView.collapse_points(max_points=3000)before encoding. - Transform points into the reference rig frame for consistent pose alignment.
4.5.5 FiLM-style modulation of the global voxel context (recommended injection)
Rather than only late-fusing the PointNeXt embedding, use it to modulate the voxel-derived global context. This is a feature-wise linear modulation (FiLM) style injection [6] that conditions the global context on the semidense geometry without disrupting the interpretable voxel channels.
Motivation (conceptual):
- The voxel field provides local occupancy evidence; the semidense embedding summarizes global geometry. We want the global geometry to shape how the voxel context is interpreted rather than merely append another vector at the end.
- FiLM-style modulation implements content-aware gating: the point cloud decides which global-context features should be amplified, suppressed, or shifted. This is analogous to conditional normalization, but keeps the base representation intact and interpretable.
- It is parameter-efficient and stable: one linear layer produces scale/shift and can be paired with GroupNorm for controlled statistics.
Formulation (per snippet, broadcast to candidates):
Let g ∈ R^{B×N×C} be the global voxel context and p ∈ R^{B×D} the PointNeXt embedding. We compute:
\[ \\gamma, \\beta = W p, \\quad g' = \\text{GN}( g \\odot (1 + \\gamma) + \\beta ) \]
where γ, β ∈ R^{B×C} are broadcast across candidates and GN is GroupNorm.
Implementation sketch:
# VinModelV2.__init__
self.point_film = nn.Linear(point_dim, 2 * field_dim)
self.point_film_norm = nn.GroupNorm(num_groups=4, num_channels=field_dim)
# VinModelV2._forward_impl
gamma, beta = self.point_film(semidense_feat).chunk(2, dim=-1)
global_feat = global_feat * (1.0 + gamma[:, None, :]) + beta[:, None, :]
global_feat = self.point_film_norm(global_feat.transpose(1, 2)).transpose(1, 2)Why this helps:
- When the voxel grid misses geometry (out-of-bounds candidates), the point embedding can down-weight unreliable global features.
- When semidense points indicate rich structure, the modulation amplifies the relevant global-context channels, improving RRI discrimination.
4.5.6 Optional: encoder for past ego trajectory (global point embedding)
Use tiny transformer to encode past ego trajectory frames in reference rig frame.
@dataclass(slots=True)
class EfmTrajectoryView:
"""World-frame rig trajectory aligned to snippet frames."""
t_world_rig: PoseTW
"""Rig SE(3) poses per frame (world←rig)."""
time_ns: Tensor
"""``Tensor["F", int64]`` pose timestamps."""
gravity_in_world: Tensor
"""``Tensor["3", float32]`` gravity vector in world frame (aligned to [0,0,-9.81])."""4.6 P5: 2D token reuse (appearance priors)
Inputs:
feat2d_upsampledortoken2dfrom EVL outputs.
Mechanism:
- Project semidense points into current RGB frames, sample 2D features, reproject to candidate view or pool directly per candidate.
Motivation: Adds appearance/texture cues without expanding the voxel grid.
5 Encoding schemas
5.1 Candidate-relative positional keys
Idea: Build pos_grid in the candidate frame so attention keys and candidate queries share the same spatial basis. This should improve alignment when the candidate is far from the reference frame.
- Implementation anchor:
VinModelV2Config.tf_pos_grid_in_candidate_frame(TODO inmodel_v2.py).
5.2 Hybrid pose encoding (R6D + shell)
Idea: Combine the current R6D+translation encoding with shell features (direction u, forward f, radius r, alignment). This keeps rotation/translation expressivity while adding geometry cues independent of voxel extent.
5.3 Distance-to-voxel features
Idea: Add explicit scalars for candidate distance to voxel center and normalized signed distance to bounds (x/y/z). This provides a direct indicator of voxel-context reliability.
5.4 Multi-scale pooling
Idea: Pool the voxel field at 2-3 grid sizes and concatenate. This helps when only a subregion of the voxel grid is informative.
5.5 Pose-conditioned attention diagnostics
The global pooling uses multi-head attention Wikipedia :: Attention (machine learning). Exposing attention weights lets us assess whether candidates attend to spatially meaningful voxels and whether out-of-grid candidates produce diffuse (uninformative) attention.
5.5.1 Theory: attention concentration as a diagnostic
Let A(q) be the attention matrix for candidate q (queries = candidate pose tokens, keys = pooled voxel tokens). If attention is well-aligned, A(q) should concentrate on a few spatial tokens. If the candidate is out-of-grid or the positional encoding is misaligned, A(q) becomes uniform.
A simple diagnostic is attention entropy:
\[ H(q) = -\\sum_i A_i(q) \\log (A_i(q) + \\varepsilon) \]
Low entropy indicates focused attention; high entropy suggests that the attention mechanism has no spatial anchor for that candidate.
5.5.2 Implementation sketch (PoseConditionedGlobalPool)
torch.nn.MultiheadAttention can return per-head attention weights by passing need_weights=True and average_attn_weights=False:
# oracle_rri/oracle_rri/vin/model_v2.py
attn_out, attn_weights = self.attn(
queries, keys, keys,
need_weights=True,
average_attn_weights=False, # keep per-head weights
)
# attn_weights: (B, num_heads, N, T)From this, compute diagnostics per candidate:
weights = attn_weights.mean(dim=1) # (B, N, T) average heads
entropy = -(weights * (weights + 1e-9).log()).sum(dim=-1) # (B, N)
peak = weights.max(dim=-1).values # (B, N)Log entropy and peak alongside valid_frac to check whether attention degenerates when coverage is low.
6 Out-of-voxel mitigation recipe
A simple gating strategy can prevent over-confident predictions when voxel coverage is low:
- Compute
valid_fracandcenter_in_bounds. - If
valid_frac < tau(e.g., 0.2), blend voxel features with P4/P5 features:feat = alpha * voxel_feat + (1 - alpha) * semidense_featwherealpha = clamp(valid_frac / tau, 0, 1).
- Add a binary OOB indicator as input to the head.
7 Suggested ablation order
- P0 baseline++ (coverage + occupancy + explicit OOB scalars).
- P1 derived channels.
- P2 compressed neck features (if P1 gains plateau).
- P4 semidense projection features (always-on).
- P3 entity-aware cues.
- P5 2D token reuse (heavier pipeline but potentially strong gains).
8 Implementation anchors
- Scene field construction:
oracle_rri/oracle_rri/vin/model_v2.py::_build_scene_field_v2. - Global pooling + positional keys:
PoseConditionedGlobalPoolinmodel_v2.py. - EVL feature contract:
oracle_rri/oracle_rri/vin/types.py::EvlBackboneOutput. - VIN documentation: VIN on EVL.