Next-Best-View Planning with Foundation Models

VCML Seminar: RRI based NBV Prediction using EFM3D Backbones

Author

Jan Duchscherer, Munich University of Applied Sciences

Published

January 1, 2026

1 Abstract

Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time).

Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [1] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Diretly optimizing reconstruction quality, as pioneered by VIN-NBV [2], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV’s generalization capabilities are rather limited to simple object-centric NBV scenarios as it does not leverage pre-tained foundation models with rich 3D spatial understanding.

This project aims to develop an NBV system that integrates VIN-NBV’s key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained egocentric foundation model as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the EVL (Egocentric Voxel Lifting) 3D EFM [3] which is pre-trained on the Aria Synthetic Environments (ASE) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views.

2 Project Vision and Goals

Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV.
Develop an oracle RRI computation pipeline using ASE visibility data, semi-dense point clouds and GT meshes.
Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing implementations in EFM3D and ATEK.
Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs.
Leverage EVL’s 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making.
Entity-aware reconstruction tracking through EVL’s OBB detection capabilities.
Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions.

3 Documentation Navigation

3.1 Setup & Installation

Setup Instructions: Environment setup and dependencies

3.2 Project State

Action Items & TODOs: Current priorities and open tasks
Project Roadmap: Milestones and timeline
Research Questions: Open problems and directions
Experimental Findings: Run logs and takeaways

3.3 Theory & Background

NBV Background: Problem framing and prior work
RRI Theory: Mathematical formulation and properties of RRI
Surface Reconstruction Metrics: Accuracy, completeness, Chamfer distance
Semi-Dense Point Clouds: SLAM-based reconstruction signals

3.4 Dataset & Resources

Aria Synthetic Environments (ASE) Dataset: Modalities, splits, mesh availability
Resources & Tools: External links to libraries, tools, datasets
Glossary: Project terminology

3.5 Literature Reviews

Literature Review: Entry point and local LaTeX corpus
VIN-NBV: Direct quality optimization with RRI
GenNBV: Continuous action spaces and RL approaches
EFM3D & EVL: Egocentric foundation models and voxel lifting
SceneScript: Structured scene language and entity representation

3.6 Implementation Guides

Implementation Overview: Metrics + RRI computation entry point
aria_nbv Overview & Architecture: Package structure and components
aria_nbv Package (formerly oracle_rri): API map + module responsibilities
ASE Data Pipeline & Streamlit Usage: Data ingestion and dashboards
aria_nbv Implementation Overview: Key modules and workflows
Computing RRI with Ground Truth Meshes: Oracle pipeline and labels
CORAL for Ordinal Regression: Ordinal binning and loss integration
View Introspection Network (VIN) on EVL: Model design and training notes
VIN v2 Feature + Encoding Proposals: Design space and ideas

3.7 External Implementations

ATEK Implementation Index: Mesh evaluation reference
EFM3D Implementation Index: Voxel/point utilities
ProjectAria Tools Reference: Dataset tooling
EFM3D Symbol Index: Keys and coordinate conventions

3.8 Project Slides

4 VIN v2 Feature + Encoding Proposals

This section embeds the implementation checklist from VIN v2 Feature + Encoding Proposals for quick reference.

4.1 Scope

This document summarizes concrete feature sets and encoding schemas to explore for VinModelV2, with a focus on mitigating EVL’s local voxel grid limits and improving RRI prediction quality. It is intended as an implementation-oriented checklist for ablations and incremental upgrades.

Related context:

Implementation overview: VIN on EVL
EFM3D/EVL background: EFM3D
VIN-NBV paper summary: VIN-NBV [2]

4.2 Context and constraints

EVL exposes a local, fixed-size voxel grid (default ~4x4x4 m) aligned to the last pose and anchored in a gravity-aligned frame. This local extent is expected but means many candidates can be partially or fully out-of-bounds. [3]
VinModelV2 currently uses pose R6D + translation, scene field channels from EVL heads/evidence, and pose-conditioned global pooling. See oracle_rri/oracle_rri/vin/model_v2.py.
VIN predicts RRI as an ordinal regression problem via CORAL [4]. Wikipedia :: Ordinal regression

4.3 Design goals

Provide high-signal scene cues for RRI without over-reliance on heavy neck features.
Remain robust when candidates fall outside the EVL voxel extent.
Preserve interpretability (feature channels map to clear geometry/evidence semantics).
Keep additions incremental and ablation-friendly.

4.4 Proposed feature bundles

The bundles below are additive. Start with P0/P1; add P2/P3 only if needed; P4 can be always-on, and P5 is optional when appearance priors are desired.

4.4.1 P0: Coverage + occupancy core (baseline++)

Inputs (from EvlBackboneOutput):

occ_pr, occ_input, free_input, counts_norm, unknown, new_surface_prior

Candidate-level scalars:

valid_frac (fraction of in-bounds samples),
center_in_bounds (center validity),
signed distance to voxel bounds (x/y/z).

Motivation: Explicitly communicates what is known vs. unknown in the local grid, and when the voxel context is unreliable.

4.4.2 P1: Boundary + uncertainty cues (derived channels)

Derived channels:

surface_boundary = occ_pr * free_input
uncertainty = occ_pr * (1 - occ_pr)

Motivation: Lightweight proxies for surface complexity and model uncertainty, without adding neck features.

4.4.3 P2: Compressed neck features

Inputs:

occ_feat (and optionally obb_feat) with 1x1x1 Conv compression (e.g., 64 -> 8/16 channels).

Motivation: Adds richer 3D semantics while keeping compute in check. Use only if P0/P1 saturate.

4.4.4 P3: Entity-aware cues (OBB-aware VIN)

Inputs:

obb_pred, obb_pred_probs_full (decoded OBBs + class probs).

Candidate-level scalars:

distance to nearest OBB center,
view alignment to OBB axes,
fraction of the candidate view frustum intersecting top-k OBBs Wikipedia :: View frustum,
top-k semantic class probabilities.

Motivation: Supports entity-aware NBV objectives and task-specific weighting.

4.4.5 P4: Semidense projection features (always-on, voxel-independent)

Inputs:

points/p3s_world (semi-dense SLAM points, i.e. a point cloud Wikipedia :: Point cloud) plus camera intrinsics and candidate poses for projection.

Per-candidate features:

projected coverage/empty fraction (F_empty),
depth statistics (mean/variance/percentiles),
visibility density per view (histogram or angular bins).

Motivation: These features are candidate-agnostic to voxel extent and should help even when the EVL grid is small. They mirror VIN-NBV’s view-projection features and can be used for every candidate. [2]

4.4.5.1 Theory: projection-based coverage as an RRI proxy

Given a candidate camera with intrinsics/extrinsics and a point cloud of the current reconstruction, we project 3D points into the candidate image plane using a standard camera model (pinhole or fisheye) Wikipedia :: Pinhole camera model. Let I be the image grid, and M(q) the set of pixels hit by at least one projected 3D point for candidate q. A simple coverage proxy is:

\[ F_{\\text{empty}}(q) = 1 - \\frac{|M(q)|}{|I|} \]

Low coverage (high F_empty) indicates that large parts of the candidate image would reveal new geometry, which correlates with potential reconstruction improvement. This is the core intuition behind VIN-NBV’s coverage feature and remains useful even when voxel features are unreliable.

4.4.5.2 Suggested injection point (VinModelV2)

Treat semidense projection features as per-candidate scalars and concatenate them alongside pose_enc and global_feat before the MLP head:

# VinModelV2.__init__
self.semiden_mlp = nn.Sequential(
    nn.Linear(sd_in_dim, sd_hidden_dim),
    nn.GELU(),
    nn.Linear(sd_hidden_dim, sd_out_dim),
)

# VinModelV2._forward_impl
semiden_feat = self._semidense_features(
    semidense_points_world=semidense_points_world,
    candidate_poses_world_cam=pose_world_cam,
    candidate_cameras=candidate_cameras,
)
semiden_feat = self.semiden_mlp(semiden_feat)
parts.append(semiden_feat)

This keeps the scene field untouched and avoids reintroducing heavy 3D volumes.

4.4.5.3 Practical recipe (using EfmPointsView::collapse_points)

Use EfmPointsView.collapse_points() to collapse time and subsample the point cloud before projection:

from efm3d.aria.camera import CameraTW
from efm3d.aria.pose import PoseTW

def project_semidense_features(
    points_view: EfmPointsView,
    candidate_poses_world_cam: PoseTW,   # world <- cam (B, N, 12)
    candidate_camera: CameraTW,          # intrinsics (per candidate or shared)
    image_size: tuple[int, int],         # (H, W)
    max_points: int = 50000,
) -> Tensor:
    # 1) Collapse points across time (subsampled)
    pts_world = points_view.collapse_points(max_points=max_points)  # (K, 3)
    if pts_world.numel() == 0:
        return torch.zeros((candidate_poses_world_cam.shape[0],
                            candidate_poses_world_cam.shape[1],
                            3), device=pts_world.device)

    # 2) Transform into each candidate camera frame
    pose_cam_world = candidate_poses_world_cam.inverse()  # cam <- world
    pts_cam = pose_cam_world[:, :, None] * pts_world      # (B, N, K, 3)

    # 3) Project to pixels + validity mask
    p2d, valid = candidate_camera.project(pts_cam)        # (B, N, K, 2), (B, N, K)

    # 4) Coverage / empty fraction (sketch)
    H, W = image_size
    in_bounds = valid & (p2d[..., 0] >= 0) & (p2d[..., 0] < W) & (p2d[..., 1] >= 0) & (p2d[..., 1] < H)
    # Convert to integer pixel coords and build a sparse occupancy mask.
    # Use scatter or bincount for efficiency.
    # Compute |M(q)| / (H*W) and derive F_empty.
    # Also collect depth statistics from pts_cam[..., 2] where in_bounds is True.
    ...

Notes:

CameraTW.project supports both pinhole and fisheye models and returns a validity mask (in-front + in-image + in-radius). This avoids manual distortion handling. See external/efm3d/efm3d/aria/camera.py for details.
If candidate_camera is shared across candidates, broadcast the camera to (B, N, ...) or slice per candidate frame index.
For performance, compute coverage using torch.bincount on p2d_int = (y * W + x) with per-candidate batching.

4.4.5.4 Optional: PointNeXt-S semidense encoder (global point embedding)

For a stronger geometric prior than simple projection stats, we can encode the semi-dense point cloud with PointNeXt-S (small, pretrained, robust to non-uniform sampling). We subsample to ~3k points (from the padded 50k) and encode the point cloud once per snippet, then concatenate the embedding to the VIN head input.

Key references:

PointNeXt paper: [5]
OpenPoints model zoo (pretrained PointNeXt-S configs/weights): [6]

Implementation notes:

The encoder is optional and only active if VinModelV2Config.point_encoder is set.
Use EfmPointsView.collapse_points(max_points=3000) before encoding.
Transform points into the reference rig frame for consistent pose alignment.

4.4.5.5 Optional: encoder for past ego trajectory (global point embedding)

Use tiny transformer to encode past ego trajectory frames in reference rig frame.

@dataclass(slots=True)
class EfmTrajectoryView:
    \"\"\"World-frame rig trajectory aligned to snippet frames.\"\"\"

    t_world_rig: PoseTW
    \"\"\"Rig SE(3) poses per frame (world←rig).\"\"\"
    time_ns: Tensor
    \"\"\"``Tensor[\"F\", int64]`` pose timestamps.\"\"\"
    gravity_in_world: Tensor
    \"\"\"``Tensor[\"3\", float32]`` gravity vector in world frame (aligned to [0,0,-9.81]).\"\"\"

4.4.6 P5: 2D token reuse (appearance priors)

Inputs:

feat2d_upsampled or token2d from EVL outputs.

Mechanism:

Project semidense points into current RGB frames, sample 2D features, reproject to candidate view or pool directly per candidate.

Motivation: Adds appearance/texture cues without expanding the voxel grid.

4.5 Encoding schemas

4.5.1 Candidate-relative positional keys

Idea: Build pos_grid in the candidate frame so attention keys and candidate queries share the same spatial basis. This should improve alignment when the candidate is far from the reference frame.

Implementation anchor: VinModelV2Config.tf_pos_grid_in_candidate_frame (TODO in model_v2.py).

4.5.2 Hybrid pose encoding (R6D + shell)

Idea: Combine the current R6D+translation encoding with shell features (direction u, forward f, radius r, alignment). This keeps rotation/translation expressivity while adding geometry cues independent of voxel extent.

4.5.3 Distance-to-voxel features

Idea: Add explicit scalars for candidate distance to voxel center and normalized signed distance to bounds (x/y/z). This provides a direct indicator of voxel-context reliability.

4.5.4 Multi-scale pooling

Idea: Pool the voxel field at 2-3 grid sizes and concatenate. This helps when only a subregion of the voxel grid is informative.

4.5.5 Pose-conditioned attention diagnostics

The global pooling uses multi-head attention Wikipedia :: Attention (machine learning). Exposing attention weights lets us assess whether candidates attend to spatially meaningful voxels and whether out-of-grid candidates produce diffuse (uninformative) attention.

4.5.5.1 Theory: attention concentration as a diagnostic

Let A(q) be the attention matrix for candidate q (queries = candidate pose tokens, keys = pooled voxel tokens). If attention is well-aligned, A(q) should concentrate on a few spatial tokens. If the candidate is out-of-grid or the positional encoding is misaligned, A(q) becomes uniform.

A simple diagnostic is attention entropy:

\[ H(q) = -\\sum_i A_i(q) \\log (A_i(q) + \\varepsilon) \]

Low entropy indicates focused attention; high entropy suggests that the attention mechanism has no spatial anchor for that candidate.

4.5.5.2 Implementation sketch (PoseConditionedGlobalPool)

torch.nn.MultiheadAttention can return per-head attention weights by passing need_weights=True and average_attn_weights=False:

# oracle_rri/oracle_rri/vin/model_v2.py
attn_out, attn_weights = self.attn(
    queries, keys, keys,
    need_weights=True,
    average_attn_weights=False,  # keep per-head weights
)
# attn_weights: (B, num_heads, N, T)

From this, compute diagnostics per candidate:

weights = attn_weights.mean(dim=1)  # (B, N, T) average heads
entropy = -(weights * (weights + 1e-9).log()).sum(dim=-1)  # (B, N)
peak = weights.max(dim=-1).values  # (B, N)

Log entropy and peak alongside valid_frac to check whether attention degenerates when coverage is low.

4.6 Out-of-voxel mitigation recipe

A simple gating strategy can prevent over-confident predictions when voxel coverage is low:

Compute valid_frac and center_in_bounds.
If valid_frac < tau (e.g., 0.2), blend voxel features with P4/P5 features:
- feat = alpha * voxel_feat + (1 - alpha) * semidense_feat where alpha = clamp(valid_frac / tau, 0, 1).
Add a binary OOB indicator as input to the head.

4.7 Suggested ablation order

P0 baseline++ (coverage + occupancy + explicit OOB scalars).
P1 derived channels.
P2 compressed neck features (if P1 gains plateau).
P4 semidense projection features (always-on).
P3 entity-aware cues.
P5 2D token reuse (heavier pipeline but potentially strong gains).

4.8 Implementation anchors

Scene field construction: oracle_rri/oracle_rri/vin/model_v2.py::_build_scene_field_v2.
Global pooling + positional keys: PoseConditionedGlobalPool in model_v2.py.
EVL feature contract: oracle_rri/oracle_rri/vin/types.py::EvlBackboneOutput.
VIN documentation: VIN on EVL.

5 External Code & Notebooks

external/
├── ATEK/                    # Aria toolkit for data processing
├── efm3d/                   # EFM3D model implementation
├── projectaria_tools/       # ASE dataset access utilities
└── scenescript/             # Scene specification language tools

notebooks/
├── ase_atek_data_exploration.ipynb    # ASE dataset exploration
├── download_ase_meshes.ipynb          # GT mesh acquisition
├── ase_atek_exploration.ipynb         # ATEK integration testing
├── ase_exploration.ipynb              # Basic ASE data analysis
└── inference.ipynb                    # Model inference experiments

6 Codebase Trees with Module Notes

oracle_rri/                         # Project root (this repo)
├── main.py                        # Builds default configs and diagnostics
├── examples/simplified_dataset_usage.py
├── scripts/
│   ├── get_context.py             # AST-based symbol scanner (feeds docs)
│   └── run_efm3d_on_ase.py        # Helper to invoke EVL on ASE shards
├── oracle_rri/                    # Package code
│   ├── analysis/depth_debugger.py # Depth→mesh distance + stats
│   ├── configs/path_config.py     # Filesystem paths + helpers
│   ├── data/                      # Typed views and dataset helpers
│   │   ├── efm_dataset.py         # WebDataset→typed EFM snippets
│   │   ├── efm_views.py           # Camera/trajectory/points/OBB typed wrappers
│   │   ├── downloader.py          # ASE mesh + ATEK tar downloader
│   │   ├── metadata.py            # Scene metadata parsing/filtering
│   │   ├── plotting.py            # Plotly helpers for snippets/meshes
│   │   └── utils.py               # Scene id parsing, validation
│   ├── data_handling/             # New entrypoints and CLI over data
│   │   ├── dataset.py             # Higher-level dataset wrapper + mesh pairing
│   │   ├── downloader.py          # CLI-friendly download orchestrator
│   │   ├── metadata.py            # Metadata cache/load utilities
│   │   └── cli.py                 # Minimal CLI shims
│   ├── pose_generation/           # Candidate pose sampling rules
│   │   ├── candidate_generation.py
│   │   ├── candidate_generation_rules.py
│   │   ├── plotting.py
│   │   └── types.py
│   ├── rendering/                 # Candidate depth rendering backends
│   │   ├── candidate_depth_renderer.py
│   │   ├── efm3d_depth_renderer.py
│   │   ├── pytorch3d_depth_renderer.py
│   │   └── plotting.py
│   ├── views/candidate_rendering.py # Point-cloud rendering from candidate poses
│   ├── visualization/candidate_app.py # Streamlit UI for sampling/RRI inspection
│   ├── viz/mesh_viz.py            # Trimesh/Plotly/Streamlit mesh + PC viz
│   ├── utils/                     # Shared utilities
│   │   ├── base_config.py         # Pydantic factory base + TOML IO
│   │   ├── console.py             # Structured logging
│   │   ├── frames.py              # Coordinate-frame helpers
│   │   ├── rich_summary.py        # Rich-style summaries
│   │   └── summary.py             # Text summaries
│   └── streamlit_app.py           # Legacy Streamlit dashboard entrypoint
├── tests/                         # Pytest suites
│   ├── data_handling/             # Dataset/downloader/metadata tests
│   ├── pose_generation/           # Sampling rule tests
│   ├── rendering/                 # Depth rendering tests
│   ├── views/                     # Candidate rendering tests
│   ├── test_candidate_rendering.py
│   ├── test_console.py
│   ├── test_efm_dataset.py
│   ├── test_mesh_cropping.py
│   ├── test_plotting_semidense.py
│   └── test_pose_generation.py
├── SIMPLIFIED_DATA_HANDLING.md    # How-to for minimal data usage
├── download_config.example.toml   # Downloader config template
├── environment.yml                # Conda environment
├── pyproject.toml / uv.lock       # Python packaging
└── notebooks/ase_oracle_rri*.py   # Helper notebooks/scripts

external/efm3d/
├── data/
│   ├── dataverse_url_parser.py      # Split/download helper for WDS tars + manifests
│   └── download_ase_mesh.py         # Fetch ASE meshes (CLI stub)
├── efm3d/
│   ├── aria/
│   │   ├── aria_constants.py        # Key names for all snippet tensors
│   │   ├── camera.py                # CameraTW calibration tensors + projections
│   │   ├── obb.py                   # ObbTW bounding-box tensor wrapper + ops
│   │   ├── pose.py                  # PoseTW SE(3) math & interpolations
│   │   ├── projection_utils.py      # Fisheye/pinhole project & unproject
│   │   └── tensor_wrapper.py        # Smart tensor wrapper base + collate helpers
│   ├── dataset/
│   │   ├── atek_vrs_dataset.py      # Stream VRS into EVL-form snippets
│   │   ├── atek_wds_dataset.py      # WDS stream → fixed-length snippets
│   │   ├── augmentation.py          # Photometric/point jitter/drop augmentations
│   │   ├── efm_model_adaptor.py     # Map ATEK fields into EVL schema + poses
│   │   ├── vrs_dataset.py           # VRS reader dataset with configs
│   │   └── wds_dataset.py           # WebDataset loader producing EVL tensors
│   ├── inference/
│   │   ├── eval.py                  # Accuracy/completeness AUC utilities
│   │   ├── fuse.py                  # Depth/feature fusion into points/voxels
│   │   ├── model.py                 # Load frozen EVL checkpoints for inference
│   │   ├── pipeline.py              # End-to-end EVL inference driver
│   │   ├── track.py                 # Trajectory utilities for snippets
│   │   └── viz.py                   # Simple matplotlib plotting helpers
│   ├── model/
│   │   ├── cnn.py                   # Build EVL 2D CNN backbone
│   │   ├── dinov2_utils.py          # DinoV2 backbone + schedulers/weight decay
│   │   ├── dpt.py                   # DPT depth head backed by DinoV2
│   │   ├── evl.py                   # EVL model wrapper
│   │   ├── evl_train.py             # Training loop orchestration
│   │   ├── image_tokenizer.py       # Image → DinoV2 tokens
│   │   ├── lifter.py                # Lift 2D tokens into 3D volumes
│   │   └── video_backbone.py        # Video backbones (DinoV2 variants)
│   ├── thirdparty/mmdetection3d/
│   │   └── iou3d.py                 # Rotated 3D IoU + CUDA NMS bindings
│   └── utils/
│       ├── common.py                # Nearest-neighbour helpers
│       ├── depth.py                 # Depth → point cloud conversion
│       ├── detection_utils.py       # Heatmap ⇄ OBB conversions + NMS
│       ├── evl_loss.py              # OBB/occupancy loss assembly
│       ├── file_utils.py            # Calibration/mesh/trajectory/file loaders
│       ├── gravity.py               # Align gravity / fix orientation
│       ├── image.py                 # Image color/resize/overlay utilities
│       ├── image_sampling.py        # Patch extraction + feature sampling
│       ├── marching_cubes.py        # Occupancy → mesh via marching cubes
│       ├── mesh_utils.py            # Mesh IO + sampling + writing
│       ├── obb_csv_writer.py        # Write OBBs to CSV/TSV
│       ├── obb_io.py                # Load OBBs and convert to wrappers
│       ├── obb_matchers.py          # IOU-based OBB matching
│       ├── obb_metrics.py           # OBB IoU metrics and summaries
│       ├── obb_trackers.py          # Track OBBs over time
│       ├── obb_utils.py             # OBB validity, transforms, pruning
│       ├── pointcloud.py            # PC downsample/visualise/save helpers
│       ├── ray.py                   # Ray grids + intersections + sampling
│       ├── reconstruction.py        # GT occupancy building + TV/occ loss
│       ├── render.py                # OpenGL-based viz primitives
│       ├── rescale.py               # Resize/rescale cameras, depth, OBBs
│       ├── viz.py                   # EGL viewer + rendering utilities
│       ├── voxel.py                 # Voxel grid creation/erosion helpers
│       └── voxel_sampling.py        # Sample voxels / coordinate transforms
├── eval.py                          # Top-level evaluation script stub
├── infer.py                         # Inference entrypoint
└── train.py                         # Training entrypoint + LR scheduler helper

References

[1]

X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang, “GenNBV: Generalizable next-best-view policy for active 3D reconstruction.” 2024. Available: https://arxiv.org/abs/2402.16174

[2]

N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219

[3]

J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224

[4]

W. Cao, V. Mirjalili, and S. Raschka, “Rank consistent ordinal regression for neural networks with application to age estimation.” 2019. Available: https://arxiv.org/abs/1901.07884

[5]

G. Qian et al., “PointNeXt: Revisiting PointNet++ with improved training and scaling strategies.” [Online]. Available: https://arxiv.org/abs/2206.04670

[6]

OpenPoints contributors, “OpenPoints: PointNeXt model zoo.” [Online]. Available: https://guochengqian.github.io/PointNeXt/modelzoo/

--- title: "Next-Best-View Planning with Foundation Models" subtitle: "VCML Seminar: RRI based NBV Prediction using EFM3D Backbones" author: "Jan Duchscherer, Munich University of Applied Sciences" date: last-modified format: html: toc: true number-sections: true bibliography: references.bib ---       # Abstract Next-Best-View (NBV) planning addresses the fundamental challenge of autonomous viewpoint selection in active 3D reconstruction, aiming to maximize the acquisition quality while minimizing the acquisition cost (i.e. number of views, traversed distance, capture time). Classical NBV methods rely on hand-crafted criteria, limited action spaces, or per-scene optimized representations. While learning-based NBV methods like GenNBV [@GenNBV-chen2024] have improved generalization by leveraging reinforcement learning, they still optimize for geometric coverage as a proxy for reconstruction quality. Since coverage maximization does not necessarily correlate with improved reconstruction quality, these methods can struggle in complex scenes with occlusions and fine details. Diretly optimizing reconstruction quality, as pioneered by VIN-NBV [@VIN-NBV-frahm2025], has shown significant improvements by predicting Relative Reconstruction Improvement (RRI) to quantify the fitness of candidate viewpoints. However, even VIN-NBV's generalization capabilities are rather limited to simple object-centric NBV scenarios as it does not leverage pre-tained foundation models with rich 3D spatial understanding. This project aims to develop an NBV system that integrates VIN-NBV's key insight to directly optimize reconstruction quality rather than proxies like coverage, leveraging a pre-trained _egocentric foundation model_ as backbone to provide strong priors for 3D spatial reasoning in complex indoor scenes. We adapt the [EVL (_Egocentric Voxel Lifting_)](contents/literature/efm3d.qmd) 3D EFM [@EFM3D-straub2024] which is pre-trained on the [Aria Synthetic Environments (ASE)](contents/ase_dataset.qmd) dataset - a large-scale synthetic egocentric dataset with 100k indoor scenes, to provide rich 3D feature volumes that capture scene geometry, semantics, and free-space priors. On top of this frozen backbone, we train a lightweight RRI prediction head that introspects both the current scene representation and candidate viewpoints to express the fitness of given candidate views. # Project Vision and Goals - Directly optimize reconstruction quality rather than surrogate coverage metrics, following RRI-based policies as per VIN-NBV. - Develop an [oracle RRI computation pipeline](contents/impl/rri_computation.qmd) using ASE visibility data, semi-dense point clouds and GT meshes. - Develop computational tools to efficiently simulate candidate viewpoints and their expected observations utilizing implementations in [EFM3D](https://github.com/facebookresearch/efm3d) and [ATEK](https://github.com/facebookresearch/ATEK). - Train an RRI predictor head on top of a frozen EFM backbone that introspects the current reconstruction and a candidate pose via imitation learning on oracle RRIs. - Leverage EVL's 3D foundation features—voxel occupancy, centerness, semantic channels, and OBB priors—as state embeddings for RRI estimation and NBV decision making. - Entity-aware reconstruction tracking through EVL's OBB detection capabilities. - Extend towards human-in-the-loop AR guidance, where entity-aware RRI weighting delivers task-specific view suggestions. # Documentation Navigation ## Setup & Installation - [Setup Instructions](contents/setup.qmd): Environment setup and dependencies ## Project State - **[Action Items & TODOs](contents/todos.qmd)**: Current priorities and open tasks - **[Project Roadmap](contents/roadmap.qmd)**: Milestones and timeline - **[Research Questions](contents/questions.qmd)**: Open problems and directions - **[Experimental Findings](contents/experiments/findings.qmd)**: Run logs and takeaways ## Theory & Background - **[NBV Background](contents/theory/nbv_background.qmd)**: Problem framing and prior work - **[RRI Theory](contents/theory/rri_theory.qmd)**: Mathematical formulation and properties of RRI - **[Surface Reconstruction Metrics](contents/theory/surface_metrics.qmd)**: Accuracy, completeness, Chamfer distance - **[Semi-Dense Point Clouds](contents/theory/semi-dense-pc.qmd)**: SLAM-based reconstruction signals ## Dataset & Resources - **[Aria Synthetic Environments (ASE) Dataset](contents/ase_dataset.qmd)**: Modalities, splits, mesh availability - **[Resources & Tools](contents/resources.qmd)**: External links to libraries, tools, datasets - **[Glossary](contents/glossary.qmd)**: Project terminology ## Literature Reviews - **[Literature Review](contents/literature/index.qmd)**: Entry point and local LaTeX corpus - **[VIN-NBV](contents/literature/vin_nbv.qmd)**: Direct quality optimization with RRI - **[GenNBV](contents/literature/gen_nbv.qmd)**: Continuous action spaces and RL approaches - **[EFM3D & EVL](contents/literature/efm3d.qmd)**: Egocentric foundation models and voxel lifting - **[SceneScript](contents/literature/scene_script.qmd)**: Structured scene language and entity representation ## Implementation Guides - **[Implementation Overview](contents/impl/overview.qmd)**: Metrics + RRI computation entry point - **[aria_nbv Overview & Architecture](contents/impl/aria_nbv_overview.qmd)**: Package structure and components - **[aria_nbv Package (formerly oracle_rri)](contents/impl/aria_nbv_package.qmd)**: API map + module responsibilities - **[ASE Data Pipeline & Streamlit Usage](contents/impl/data_pipeline_overview.qmd)**: Data ingestion and dashboards - **[aria_nbv Implementation Overview](contents/impl/oracle_rri_impl.qmd)**: Key modules and workflows - **[Computing RRI with Ground Truth Meshes](contents/impl/rri_computation.qmd)**: Oracle pipeline and labels - **[CORAL for Ordinal Regression](contents/impl/coral_intergarion.qmd)**: Ordinal binning and loss integration - **[View Introspection Network (VIN) on EVL](contents/impl/vin_nbv.qmd)**: Model design and training notes - **[VIN v2 Feature + Encoding Proposals](contents/impl/vin_v2_feature_proposals.qmd)**: Design space and ideas ## External Implementations - **[ATEK Implementation Index](contents/ext-impl/atek_implementation.qmd)**: Mesh evaluation reference - **[EFM3D Implementation Index](contents/ext-impl/efm3d_implementation.qmd)**: Voxel/point utilities - **[ProjectAria Tools Reference](contents/ext-impl/prj_aria_tools_impl.qmd)**: Dataset tooling - **[EFM3D Symbol Index](contents/ext-impl/efm3d_symbol_index.qmd)**: Keys and coordinate conventions ## Project Slides - [Presentation 01](typst/slides/slides_1.pdf), [typst-src](typst/slides/slides_1.typ) - [Presentation 02](typst/slides/slides_2.pdf), [typst-src](typst/slides/slides_2.typ) # VIN v2 Feature + Encoding Proposals This section embeds the implementation checklist from [VIN v2 Feature + Encoding Proposals](contents/impl/vin_v2_feature_proposals.qmd) for quick reference. ## Scope This document summarizes concrete feature sets and encoding schemas to explore for **VinModelV2**, with a focus on mitigating EVL's local voxel grid limits and improving RRI prediction quality. It is intended as an implementation-oriented checklist for ablations and incremental upgrades. Related context: - Implementation overview: [VIN on EVL](contents/impl/vin_nbv.qmd) - EFM3D/EVL background: [EFM3D](contents/literature/efm3d.qmd) - VIN-NBV paper summary: [VIN-NBV](contents/literature/vin_nbv.qmd) [@VIN-NBV-frahm2025] ## Context and constraints - EVL exposes a **local, fixed-size voxel grid** (default ~4x4x4 m) aligned to the last pose and anchored in a gravity-aligned frame. This local extent is expected but means many candidates can be partially or fully out-of-bounds. [@EFM3D-straub2024] - VinModelV2 currently uses **pose R6D + translation**, **scene field channels** from EVL heads/evidence, and **pose-conditioned global pooling**. See `oracle_rri/oracle_rri/vin/model_v2.py`. - VIN predicts RRI as an **ordinal regression** problem via CORAL [@CORAL-cao2019]. [Wikipedia :: Ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression) ## Design goals - Provide **high-signal scene cues** for RRI without over-reliance on heavy neck features. - Remain robust when candidates **fall outside the EVL voxel extent**. - Preserve **interpretability** (feature channels map to clear geometry/evidence semantics). - Keep additions incremental and ablation-friendly. ## Proposed feature bundles The bundles below are additive. Start with P0/P1; add P2/P3 only if needed; P4 can be always-on, and P5 is optional when appearance priors are desired. ### P0: Coverage + occupancy core (baseline++) **Inputs** (from `EvlBackboneOutput`): - `occ_pr`, `occ_input`, `free_input`, `counts_norm`, `unknown`, `new_surface_prior` **Candidate-level scalars**: - `valid_frac` (fraction of in-bounds samples), - `center_in_bounds` (center validity), - signed distance to voxel bounds (x/y/z). **Motivation**: Explicitly communicates *what is known* vs. *unknown* in the local grid, and when the voxel context is unreliable. ### P1: Boundary + uncertainty cues (derived channels) **Derived channels**: - `surface_boundary = occ_pr * free_input` - `uncertainty = occ_pr * (1 - occ_pr)` **Motivation**: Lightweight proxies for surface complexity and model uncertainty, without adding neck features. ### P2: Compressed neck features **Inputs**: - `occ_feat` (and optionally `obb_feat`) with 1x1x1 Conv compression (e.g., 64 -> 8/16 channels). **Motivation**: Adds richer 3D semantics while keeping compute in check. Use only if P0/P1 saturate. ### P3: Entity-aware cues (OBB-aware VIN) **Inputs**: - `obb_pred`, `obb_pred_probs_full` (decoded OBBs + class probs). **Candidate-level scalars**: - distance to nearest OBB center, - view alignment to OBB axes, - fraction of the candidate view frustum intersecting top-k OBBs [Wikipedia :: View frustum](https://en.wikipedia.org/wiki/View_frustum), - top-k semantic class probabilities. **Motivation**: Supports entity-aware NBV objectives and task-specific weighting. ### P4: Semidense projection features (always-on, voxel-independent) **Inputs**: - `points/p3s_world` (semi-dense SLAM points, i.e. a **point cloud** [Wikipedia :: Point cloud](https://en.wikipedia.org/wiki/Point_cloud)) plus camera intrinsics and candidate poses for projection. **Per-candidate features**: - projected coverage/empty fraction (`F_empty`), - depth statistics (mean/variance/percentiles), - visibility density per view (histogram or angular bins). **Motivation**: These features are **candidate-agnostic to voxel extent** and should help even when the EVL grid is small. They mirror VIN-NBV's view-projection features and can be used for every candidate. [@VIN-NBV-frahm2025] #### Theory: projection-based coverage as an RRI proxy Given a candidate camera with intrinsics/extrinsics and a point cloud of the current reconstruction, we project 3D points into the candidate image plane using a standard camera model (pinhole or fisheye) [Wikipedia :: Pinhole camera model](https://en.wikipedia.org/wiki/Pinhole_camera_model). Let `I` be the image grid, and `M(q)` the set of pixels hit by at least one projected 3D point for candidate `q`. A simple coverage proxy is: $$ F_{\\text{empty}}(q) = 1 - \\frac{|M(q)|}{|I|} $$ Low coverage (high `F_empty`) indicates that large parts of the candidate image would reveal **new** geometry, which correlates with potential reconstruction improvement. This is the core intuition behind VIN-NBV's coverage feature and remains useful even when voxel features are unreliable. #### Suggested injection point (VinModelV2) Treat semidense projection features as **per-candidate scalars** and concatenate them alongside `pose_enc` and `global_feat` before the MLP head: ```python # VinModelV2.__init__ self.semiden_mlp = nn.Sequential( nn.Linear(sd_in_dim, sd_hidden_dim), nn.GELU(), nn.Linear(sd_hidden_dim, sd_out_dim), ) # VinModelV2._forward_impl semiden_feat = self._semidense_features( semidense_points_world=semidense_points_world, candidate_poses_world_cam=pose_world_cam, candidate_cameras=candidate_cameras, ) semiden_feat = self.semiden_mlp(semiden_feat) parts.append(semiden_feat) ``` This keeps the scene field untouched and avoids reintroducing heavy 3D volumes. #### Practical recipe (using EfmPointsView::collapse_points) Use `EfmPointsView.collapse_points()` to collapse time and subsample the point cloud before projection: ```python from efm3d.aria.camera import CameraTW from efm3d.aria.pose import PoseTW def project_semidense_features( points_view: EfmPointsView, candidate_poses_world_cam: PoseTW, # world <- cam (B, N, 12) candidate_camera: CameraTW, # intrinsics (per candidate or shared) image_size: tuple[int, int], # (H, W) max_points: int = 50000, ) -> Tensor: # 1) Collapse points across time (subsampled) pts_world = points_view.collapse_points(max_points=max_points) # (K, 3) if pts_world.numel() == 0: return torch.zeros((candidate_poses_world_cam.shape[0], candidate_poses_world_cam.shape[1], 3), device=pts_world.device) # 2) Transform into each candidate camera frame pose_cam_world = candidate_poses_world_cam.inverse() # cam <- world pts_cam = pose_cam_world[:, :, None] * pts_world # (B, N, K, 3) # 3) Project to pixels + validity mask p2d, valid = candidate_camera.project(pts_cam) # (B, N, K, 2), (B, N, K) # 4) Coverage / empty fraction (sketch) H, W = image_size in_bounds = valid & (p2d[..., 0] >= 0) & (p2d[..., 0] < W) & (p2d[..., 1] >= 0) & (p2d[..., 1] < H) # Convert to integer pixel coords and build a sparse occupancy mask. # Use scatter or bincount for efficiency. # Compute |M(q)| / (H*W) and derive F_empty. # Also collect depth statistics from pts_cam[..., 2] where in_bounds is True. ... ``` Notes: - `CameraTW.project` supports both pinhole and fisheye models and returns a validity mask (in-front + in-image + in-radius). This avoids manual distortion handling. See `external/efm3d/efm3d/aria/camera.py` for details. - If `candidate_camera` is shared across candidates, broadcast the camera to `(B, N, ...)` or slice per candidate frame index. - For performance, compute coverage using `torch.bincount` on `p2d_int = (y * W + x)` with per-candidate batching. #### Optional: PointNeXt-S semidense encoder (global point embedding) For a stronger geometric prior than simple projection stats, we can encode the semi-dense point cloud with **PointNeXt-S** (small, pretrained, robust to non-uniform sampling). We subsample to ~3k points (from the padded 50k) and encode the point cloud once per snippet, then concatenate the embedding to the VIN head input. Key references: - PointNeXt paper: [@PointNeXt-qian2022] - OpenPoints model zoo (pretrained PointNeXt-S configs/weights): [@OpenPoints-modelzoo] Implementation notes: - The encoder is **optional** and only active if `VinModelV2Config.point_encoder` is set. - Use `EfmPointsView.collapse_points(max_points=3000)` before encoding. - Transform points into the **reference rig frame** for consistent pose alignment. #### Optional: encoder for past ego trajectory (global point embedding) Use tiny transformer to encode past ego trajectory frames in reference rig frame. ```py @dataclass(slots=True) class EfmTrajectoryView: \"\"\"World-frame rig trajectory aligned to snippet frames.\"\"\" t_world_rig: PoseTW \"\"\"Rig SE(3) poses per frame (world←rig).\"\"\" time_ns: Tensor \"\"\"``Tensor[\"F\", int64]`` pose timestamps.\"\"\" gravity_in_world: Tensor \"\"\"``Tensor[\"3\", float32]`` gravity vector in world frame (aligned to [0,0,-9.81]).\"\"\" ``` ### P5: 2D token reuse (appearance priors) **Inputs**: - `feat2d_upsampled` or `token2d` from EVL outputs. **Mechanism**: - Project semidense points into current RGB frames, sample 2D features, reproject to candidate view or pool directly per candidate. **Motivation**: Adds appearance/texture cues without expanding the voxel grid. ## Encoding schemas ### Candidate-relative positional keys **Idea**: Build `pos_grid` in the **candidate frame** so attention keys and candidate queries share the same spatial basis. This should improve alignment when the candidate is far from the reference frame. - Implementation anchor: `VinModelV2Config.tf_pos_grid_in_candidate_frame` (TODO in `model_v2.py`). ### Hybrid pose encoding (R6D + shell) **Idea**: Combine the current R6D+translation encoding with shell features (direction `u`, forward `f`, radius `r`, alignment). This keeps rotation/translation expressivity while adding geometry cues independent of voxel extent. ### Distance-to-voxel features **Idea**: Add explicit scalars for candidate distance to voxel center and normalized signed distance to bounds (x/y/z). This provides a direct indicator of voxel-context reliability. ### Multi-scale pooling **Idea**: Pool the voxel field at 2-3 grid sizes and concatenate. This helps when only a subregion of the voxel grid is informative. ### Pose-conditioned attention diagnostics The global pooling uses multi-head attention [Wikipedia :: Attention (machine learning)](https://en.wikipedia.org/wiki/Attention_(machine_learning)). Exposing attention weights lets us assess whether candidates attend to spatially meaningful voxels and whether out-of-grid candidates produce **diffuse** (uninformative) attention. #### Theory: attention concentration as a diagnostic Let `A(q)` be the attention matrix for candidate `q` (queries = candidate pose tokens, keys = pooled voxel tokens). If attention is well-aligned, `A(q)` should concentrate on a few spatial tokens. If the candidate is out-of-grid or the positional encoding is misaligned, `A(q)` becomes uniform. A simple diagnostic is **attention entropy**: $$ H(q) = -\\sum_i A_i(q) \\log (A_i(q) + \\varepsilon) $$ Low entropy indicates focused attention; high entropy suggests that the attention mechanism has no spatial anchor for that candidate. #### Implementation sketch (PoseConditionedGlobalPool) `torch.nn.MultiheadAttention` can return per-head attention weights by passing `need_weights=True` and `average_attn_weights=False`: ```python # oracle_rri/oracle_rri/vin/model_v2.py attn_out, attn_weights = self.attn( queries, keys, keys, need_weights=True, average_attn_weights=False, # keep per-head weights ) # attn_weights: (B, num_heads, N, T) ``` From this, compute diagnostics per candidate: ```python weights = attn_weights.mean(dim=1) # (B, N, T) average heads entropy = -(weights * (weights + 1e-9).log()).sum(dim=-1) # (B, N) peak = weights.max(dim=-1).values # (B, N) ``` Log `entropy` and `peak` alongside `valid_frac` to check whether attention degenerates when coverage is low. ## Out-of-voxel mitigation recipe A simple gating strategy can prevent over-confident predictions when voxel coverage is low: 1. Compute `valid_frac` and `center_in_bounds`. 2. If `valid_frac < tau` (e.g., 0.2), blend voxel features with P4/P5 features: - `feat = alpha * voxel_feat + (1 - alpha) * semidense_feat` where `alpha = clamp(valid_frac / tau, 0, 1)`. 3. Add a binary OOB indicator as input to the head. ## Suggested ablation order 1. P0 baseline++ (coverage + occupancy + explicit OOB scalars). 2. P1 derived channels. 3. P2 compressed neck features (if P1 gains plateau). 4. P4 semidense projection features (always-on). 5. P3 entity-aware cues. 6. P5 2D token reuse (heavier pipeline but potentially strong gains). ## Implementation anchors - Scene field construction: `oracle_rri/oracle_rri/vin/model_v2.py::_build_scene_field_v2`. - Global pooling + positional keys: `PoseConditionedGlobalPool` in `model_v2.py`. - EVL feature contract: `oracle_rri/oracle_rri/vin/types.py::EvlBackboneOutput`. - VIN documentation: [VIN on EVL](contents/impl/vin_nbv.qmd). # External Code & Notebooks ``` external/ ├── ATEK/ # Aria toolkit for data processing ├── efm3d/ # EFM3D model implementation ├── projectaria_tools/ # ASE dataset access utilities └── scenescript/ # Scene specification language tools notebooks/ ├── ase_atek_data_exploration.ipynb # ASE dataset exploration ├── download_ase_meshes.ipynb # GT mesh acquisition ├── ase_atek_exploration.ipynb # ATEK integration testing ├── ase_exploration.ipynb # Basic ASE data analysis └── inference.ipynb # Model inference experiments ``` # Codebase Trees with Module Notes ```text oracle_rri/ # Project root (this repo) ├── main.py # Builds default configs and diagnostics ├── examples/simplified_dataset_usage.py ├── scripts/ │ ├── get_context.py # AST-based symbol scanner (feeds docs) │ └── run_efm3d_on_ase.py # Helper to invoke EVL on ASE shards ├── oracle_rri/ # Package code │ ├── analysis/depth_debugger.py # Depth→mesh distance + stats │ ├── configs/path_config.py # Filesystem paths + helpers │ ├── data/ # Typed views and dataset helpers │ │ ├── efm_dataset.py # WebDataset→typed EFM snippets │ │ ├── efm_views.py # Camera/trajectory/points/OBB typed wrappers │ │ ├── downloader.py # ASE mesh + ATEK tar downloader │ │ ├── metadata.py # Scene metadata parsing/filtering │ │ ├── plotting.py # Plotly helpers for snippets/meshes │ │ └── utils.py # Scene id parsing, validation │ ├── data_handling/ # New entrypoints and CLI over data │ │ ├── dataset.py # Higher-level dataset wrapper + mesh pairing │ │ ├── downloader.py # CLI-friendly download orchestrator │ │ ├── metadata.py # Metadata cache/load utilities │ │ └── cli.py # Minimal CLI shims │ ├── pose_generation/ # Candidate pose sampling rules │ │ ├── candidate_generation.py │ │ ├── candidate_generation_rules.py │ │ ├── plotting.py │ │ └── types.py │ ├── rendering/ # Candidate depth rendering backends │ │ ├── candidate_depth_renderer.py │ │ ├── efm3d_depth_renderer.py │ │ ├── pytorch3d_depth_renderer.py │ │ └── plotting.py │ ├── views/candidate_rendering.py # Point-cloud rendering from candidate poses │ ├── visualization/candidate_app.py # Streamlit UI for sampling/RRI inspection │ ├── viz/mesh_viz.py # Trimesh/Plotly/Streamlit mesh + PC viz │ ├── utils/ # Shared utilities │ │ ├── base_config.py # Pydantic factory base + TOML IO │ │ ├── console.py # Structured logging │ │ ├── frames.py # Coordinate-frame helpers │ │ ├── rich_summary.py # Rich-style summaries │ │ └── summary.py # Text summaries │ └── streamlit_app.py # Legacy Streamlit dashboard entrypoint ├── tests/ # Pytest suites │ ├── data_handling/ # Dataset/downloader/metadata tests │ ├── pose_generation/ # Sampling rule tests │ ├── rendering/ # Depth rendering tests │ ├── views/ # Candidate rendering tests │ ├── test_candidate_rendering.py │ ├── test_console.py │ ├── test_efm_dataset.py │ ├── test_mesh_cropping.py │ ├── test_plotting_semidense.py │ └── test_pose_generation.py ├── SIMPLIFIED_DATA_HANDLING.md # How-to for minimal data usage ├── download_config.example.toml # Downloader config template ├── environment.yml # Conda environment ├── pyproject.toml / uv.lock # Python packaging └── notebooks/ase_oracle_rri*.py # Helper notebooks/scripts ``` ```text external/efm3d/ ├── data/ │ ├── dataverse_url_parser.py # Split/download helper for WDS tars + manifests │ └── download_ase_mesh.py # Fetch ASE meshes (CLI stub) ├── efm3d/ │ ├── aria/ │ │ ├── aria_constants.py # Key names for all snippet tensors │ │ ├── camera.py # CameraTW calibration tensors + projections │ │ ├── obb.py # ObbTW bounding-box tensor wrapper + ops │ │ ├── pose.py # PoseTW SE(3) math & interpolations │ │ ├── projection_utils.py # Fisheye/pinhole project & unproject │ │ └── tensor_wrapper.py # Smart tensor wrapper base + collate helpers │ ├── dataset/ │ │ ├── atek_vrs_dataset.py # Stream VRS into EVL-form snippets │ │ ├── atek_wds_dataset.py # WDS stream → fixed-length snippets │ │ ├── augmentation.py # Photometric/point jitter/drop augmentations │ │ ├── efm_model_adaptor.py # Map ATEK fields into EVL schema + poses │ │ ├── vrs_dataset.py # VRS reader dataset with configs │ │ └── wds_dataset.py # WebDataset loader producing EVL tensors │ ├── inference/ │ │ ├── eval.py # Accuracy/completeness AUC utilities │ │ ├── fuse.py # Depth/feature fusion into points/voxels │ │ ├── model.py # Load frozen EVL checkpoints for inference │ │ ├── pipeline.py # End-to-end EVL inference driver │ │ ├── track.py # Trajectory utilities for snippets │ │ └── viz.py # Simple matplotlib plotting helpers │ ├── model/ │ │ ├── cnn.py # Build EVL 2D CNN backbone │ │ ├── dinov2_utils.py # DinoV2 backbone + schedulers/weight decay │ │ ├── dpt.py # DPT depth head backed by DinoV2 │ │ ├── evl.py # EVL model wrapper │ │ ├── evl_train.py # Training loop orchestration │ │ ├── image_tokenizer.py # Image → DinoV2 tokens │ │ ├── lifter.py # Lift 2D tokens into 3D volumes │ │ └── video_backbone.py # Video backbones (DinoV2 variants) │ ├── thirdparty/mmdetection3d/ │ │ └── iou3d.py # Rotated 3D IoU + CUDA NMS bindings │ └── utils/ │ ├── common.py # Nearest-neighbour helpers │ ├── depth.py # Depth → point cloud conversion │ ├── detection_utils.py # Heatmap ⇄ OBB conversions + NMS │ ├── evl_loss.py # OBB/occupancy loss assembly │ ├── file_utils.py # Calibration/mesh/trajectory/file loaders │ ├── gravity.py # Align gravity / fix orientation │ ├── image.py # Image color/resize/overlay utilities │ ├── image_sampling.py # Patch extraction + feature sampling │ ├── marching_cubes.py # Occupancy → mesh via marching cubes │ ├── mesh_utils.py # Mesh IO + sampling + writing │ ├── obb_csv_writer.py # Write OBBs to CSV/TSV │ ├── obb_io.py # Load OBBs and convert to wrappers │ ├── obb_matchers.py # IOU-based OBB matching │ ├── obb_metrics.py # OBB IoU metrics and summaries │ ├── obb_trackers.py # Track OBBs over time │ ├── obb_utils.py # OBB validity, transforms, pruning │ ├── pointcloud.py # PC downsample/visualise/save helpers │ ├── ray.py # Ray grids + intersections + sampling │ ├── reconstruction.py # GT occupancy building + TV/occ loss │ ├── render.py # OpenGL-based viz primitives │ ├── rescale.py # Resize/rescale cameras, depth, OBBs │ ├── viz.py # EGL viewer + rendering utilities │ ├── voxel.py # Voxel grid creation/erosion helpers │ └── voxel_sampling.py # Sample voxels / coordinate transforms ├── eval.py # Top-level evaluation script stub ├── infer.py # Inference entrypoint └── train.py # Training entrypoint + LR scheduler helper ```