1 EFM3D & EVL

1.1 Overview

  • Porposed Method: EVL was designed as a base-line 3D egocentric foundation model that leverages strong priors from various egocentric modalities and inherits foundational capabilities from any (frozen) 2D backbone.
    • 86.6 M non-trainable parameters (DinoV2.5)
    • 16.7 M trainable parameters (3D U-Net + heads)
    • predicts occupancy grids and gravity-aligned OBBs in one forward pass.
  • Training Objectives: 3D surface reconstruction and 3D OBB detection. OBBs have 7 DoF (center, dimensions, yaw) and are predicted via a center-ness + offset.
  • Dataset:
    • Trained (and evaluated) on ASE dataset
      • 3D OBBs and GT meshes were released for the ASE dataset, in the context of [1].
      • 3M synthetic OBB instances across 100K scenes and 43 classes.
    • Evaluated on real-world ADT (surface estimation) and AEO (OBB detection) with GT meshes and OBBs respectively.

1.2 Input & Output Formulation

1.2.1 Inputs

EVL consumes \(T\) posed frames (\(T=10\), randomly sampled 1 s snippet at 10 fps from ASE) from each stream \(s \in S\) (RGB, SLAM-L, SLAM-R), together with camera intrinsics/extrinsics and semi-dense SLAM points carrying visibility metadata [1]:

\[ \mathcal{X}_{in} = \Big\{(\mathbf{I}_t^s, \mathbf{K}^s, \mathbf{T}_t^s)\Big\}_{t=1,s \in S}^{T} \cup \mathcal{P}_t^{\text{semi}}. \]

Semi-dense points & visibility: Each snippet provides a time-aligned semi-dense point cloud \(\mathcal{P}_t^{\text{semi}}\) with per-observation visibility metadata (camera index and timestamp). The lifter converts this to two binary masks:

\[ \mathcal{M}_{\text{surf}}(\mathbf{v}) = \mathbb{1}\{\exists\, \mathbf{p}\in\mathcal{P}_t^{\text{semi}} : \mathbf{p} \text{ falls inside voxel } \mathbf{v}\}, \]

\[ \mathcal{M}_{\text{free}}(\mathbf{v}) = \mathbb{1}\{\exists\, (s,t,\mathbf{p}) : \mathbf{v} \text{ lies on the ray segment from camera } c_t^s \text{ to } \mathbf{p} \}. \]

To compute \(\mathcal{M}_{\text{free}}\), EVL back-projects each observed point for every camera that saw it, samples \(S\) evenly spaced depths along the camera ray, and marks all voxels up to (but not including) the surface (which would be \(\mathcal{M}_{\text{surf}}\)) hit as free space (voxel/occ_input) efm3d/model/lifter.py:180. The same projection pass accumulates a visibility count per voxel (voxel/counts), indicating how many valid samples support each location; these counts are normalized or thresholded to serve as confidence weights during learning.

1.2.2 EVL data flow

  1. 2D backbone: Frozen DinoV2.5 extracts per-frame feature maps with \(F\) channels for each stream \(s \in S\).
  2. Voxel lifting: Each voxel center in the gravity-aligned grid (anchored at the last RGB pose) is projected onto the \(|S| \times T\) feature maps to sample 2D features using bilinear interpolation. The projections are described by \((\mathbf{K}^s,\mathbf{T}_t^s)\) ~ efm3d/model/lifter.py:398. This yields a \(|S| \times T \times F \times D \times H \times W\) feature volume.
  3. Temporal/stream aggregation: Lifted features are aggregated (mean, variance) across \(t\) and \(s \rightarrow 2F \times D \times H \times W\) feature volume.
  4. Geometric priors: Point and free-space masks from \(\mathcal{P}_{\text{semi}}\) are appended, yielding \((F'+2) \times D \times H \times W\) before the 3D U-Net.
  5. 3D backbone & heads: A 3D InvResnet FPN U-Net processes the volume and feeds the surface and detection heads.

1.2.3 Outputs

The network anchors a gravity-aligned voxel grid \(\mathcal{V}\) in front of the last RGB pose and outputs dense 3D tensors, which are then post-processed into sparse OBB detections:

\[ \mathcal{V}_{out} = \left\{\mathbf{V}_{\text{occ}}, \mathbf{V}_{\text{cent}}, \mathbf{V}_{\text{bbox}}, \mathbf{V}_{\text{class}}\right\}\; \rightarrow\; \mathcal{O} = \{\mathbf{b}_i^{3D}, c_i, s_i\}_{i=1}^{N} \]

Outputs – 3D Surface Regression

  • \(\mathbf{V}_{\text{occ}} \in \mathbb{R}^{D\times H\times W}\): Per-voxel occupancy logits whose sigmoid approximates surface probability; supervised with free/surface/occupied samples from GT depth efm3d/utils/reconstruction.py:37.

Outputs – 3D OBB Detection

  • \(\mathbf{V}_{\text{cent}} \in [0,1]^{D\times H\times W}\): Centerness scores marking probable box centers prior to 3D NMS efm3d/utils/evl_loss.py:120.
  • \(\mathbf{V}_{\text{bbox}} \in \mathbb{R}^{7 \times D \times H \times W}\): Regression channels for box size, center offset, and yaw in the gravity-aligned voxel frame efm3d/model/evl.py:200.
  • \(\mathbf{V}_{\text{class}} \in \mathbb{R}^{C \times D \times H \times W}\): Class logits over \(C=43\) semantic categories; softmax delivers per-voxel class probabilities.
  • \(\mathcal{O} = \{\mathbf{b}_i^{3D}, c_i, s_i\}\): Sparse oriented boxes after centerness thresholding and NMS, including 7-DoF geometry, semantic ID, and confidence efm3d/model/evl.py:137.

1.3 Batch Dictionary Overview

Check out EVL Training Format for more info.

Ground-Truth Tensors

  • rgb/distance_m: Ray distances from GT depth maps, used to sample free/surface/occupied points for occupancy supervision.
  • points/p3s_world, points/dist_std: Semi-dense SLAM points with uncertainty, feeding the point and free-space masks.
  • obbs/padded_snippet: Gravity-aligned GT boxes (ObbTW) with per-frame padding, semantic labels, and projected 2D boxes efm3d/dataset/efm_model_adaptor.py:480.
  • snippet/t_world_snippet, pose/t_snippet_rig: Pose anchoring the gravity-aligned snippet frame in world coordinates.

Other tensors

  • rgb/img, slaml/img, slamr/img: Normalized image tensors of shape \([B, T, C, H, W]\).
  • rgb/calib, slaml/calib, slamr/calib: CameraTW wrappers holding per-frame intrinsics/extrinsics.
  • rgb/t_snippet_rig, slaml/t_snippet_rig, slamr/t_snippet_rig: Camera-to-snippet poses (PoseTW) for each stream.
  • pose/t_snippet_rig, pose/t_world_rig: Rig trajectory expressed in snippet/world frames.
  • points/dist_std, points/inv_dist_std: Semi-dense point clouds and distance uncertainty.
  • voxel/feat: Lifted 3D feature volume (channels × \(D \times H \times W\)) before the 3D U-Net.
  • voxel/counts: Normalized projection counts per voxel (observation frequency).
  • voxel/occ_input: Concatenated surface/free-space masks.
  • voxel/pts_world: World coordinates of voxel centers.
  • voxel/T_world_voxel: Pose of the voxel grid with respect to world frame.
  • voxel/selectT: Frame index used to anchor the voxel grid.
  • occ_pr, cent_pr, bbox_pr, clas_pr: Raw head outputs prior to thresholding/NMS.
  • obbs_pr_nms, cent_pr_nms: Post-processed detection outputs (sparse OBBs and filtered centerness).

You’re looking at one mixed “debug dictionary” that EVL returns: it contains (i) 2D backbone artifacts, (ii) the lifted voxel grid + observation masks, (iii) the refined 3D neck features, and (iv) the task heads (occupancy + 3D OBB detection). The keys are set mostly in Lifter.forward() (voxel/rgb stuff) and EVL.forward() + EVL.post_process() (neck + heads + decoded boxes). (GitHub)

Below I explain each entry you printed and give a use / don’t-use recommendation for a VIN-style RRI scorer.


2 The short, skeptical answer to “should we use head outputs instead of neck features?”

Yes — for a minimal VIN, it’s reasonable to start with head outputs, because:

  • occ_pr is already an explicit surface/occupancy reconstruction signal (bounded, low‑dimensional). (GitHub)
  • The lifter provides explicit observation evidence (voxel/counts, voxel/occ_input + free-space mask inside voxel/feat) that’s highly aligned with NBV/RRI intuition: choose views that cover unknown or weakly observed regions. (GitHub)

But: don’t throw away neck features permanently. Neck features are the “pre-decision” representation that heads compress into very few channels. If performance stalls, re‑introduce neck features via a 1×1×1 compression. (GitHub)


3 Key-by-key explanation and “should VIN use it?”

I’ll group them by function. Shapes refer to your printout (they depend on config/checkpoint).

3.1 A) Core voxel geometry / coordinate contract

3.1.1 voxel/T_world_voxelPoseTW (B, 12)

What it is: The pose of the voxel grid in world coordinates (world←voxel). The voxel grid is anchored to the last frame of the snippet and gravity-aligned (roll/pitch ≈ 0; yaw-only). (GitHub) Use for VIN? Must-use. You need it to map candidate-frustum 3D sample points (world) into voxel coordinates for trilinear sampling.

3.1.2 voxel_extent(6,)

What it is: [x_min, x_max, y_min, y_max, z_min, z_max] defining the voxel grid’s metric bounds in the voxel frame (meters). EVL passes this around explicitly. (GitHub) Use for VIN? Must-use. Required to normalize voxel coordinates and implement valid masks for sampling.

3.1.3 voxel/pts_world(B, D*H*W, 3) (here 110592 = 48³)

What it is: The world coordinates of every voxel center (a flattened voxel grid). It’s generated from voxel_extent and T_world_voxel. (GitHub) Use for VIN? Usually no. It’s redundant (you can generate the same points on demand). Only useful for debugging/visualization or if you want a one-time precomputed voxel-center point cloud.

3.1.4 voxel/selectT(B,) int

What it is: The selected time index used to anchor the voxel grid (by default T-1). (GitHub) Use for VIN? No (training), yes (debug). You typically only need T_world_voxel directly.


3.2 B) Observation evidence injected into 3D (high value for NBV/RRI)

3.2.1 voxel/counts(B, D, H, W) int64

What it is: For each voxel, how many snippet frames/streams produced a valid projection into the image during lifting. It is computed as a sum over valid projection masks. (GitHub) Use for VIN? Yes (strongly recommended). This is a direct proxy for coverage / observation density inside EVL’s voxel volume.

Practical note: Normalize it to [0,1] by dividing by the maximum possible count (≈ T * #streams_used), converted to float.

3.2.2 voxel/counts_m(B, D, H, W) int64

What it is: A “masked/debug” variant of the counts (EVL explicitly says it’s passed for debugging). (GitHub) Use for VIN? No for learning, unless you’ve verified it’s identical to what you want. Prefer voxel/counts + explicit masking logic in your VIN.

3.2.3 voxel/occ_input(B, 1, D, H, W) float (mask)

What it is: A binary occupancy mask derived from the input 3D points (semi-dense/GT points in the batch): EVL voxelizes the point cloud and turns voxels with any points into 1.0. (GitHub) Use for VIN? Yes. This is extremely aligned with the oracle’s “current reconstruction” (semi-dense points), and it helps predict where RRI can still improve.

3.2.4 voxel/feat(B, F, D, H, W) (here F=34)

What it is: The raw lifted voxel feature volume before the neck. It is:

  • lifted + aggregated 2D features (here 32 channels),
  • concatenated with point_masks and free_masks (2 extra channels). (GitHub)

Use for VIN?

  • Not as-is, unless you compress it. It’s a fat tensor that’s less refined than the neck, and you’ll pay for sampling it.
  • But: the last channel is the free-space mask (from ray samples up to observed points), which is valuable and otherwise not exposed as its own key in your printout. (GitHub)

Actionable recommendation: For a head-centric VIN, extract just:

  • occ_input = voxel/occ_input (occupied evidence),
  • free_input = voxel/feat[:, -1:] (free-space evidence),
  • counts = voxel/counts (coverage), and optionally ignore the rest of voxel/feat.

3.3 C) 3D neck features (rich, but heavier)

3.3.1 neck/occ_feat(B, 64, D, H, W)

What it is: The refined 3D feature volume right before the occupancy head. EVL stores it explicitly. (GitHub) Use for VIN? Optional. If you want “maximum information” from EVL, this is the most stable attachment point. But it’s heavier than head outputs.

If you use it: compress channels with a 1×1×1 conv (e.g., 64 → 16/32) before any pooling/sampling.

3.3.2 neck/obb_feat(B, 64, D, H, W)

What it is: Refined 3D features right before the OBB detection head. (GitHub) Use for VIN? Optional / future entity-aware. For pure geometry-RRI, you can omit it initially.


3.4 D) Occupancy head output (surface reconstruction head)

3.4.1 occ_pr(B, 1, D, H, W) float32

What it is: The occupancy prediction after sigmoid: occ_pr = sigmoid(occ_logits). (GitHub) Use for VIN? Yes. This is the single most sensible “head output” for an RRI predictor.

How to use:

  • Global pooling (mean/max) for a snippet descriptor.
  • Candidate-conditioned frustum sampling (sample voxels along candidate view rays) to estimate how much unknown/occupied structure lies ahead.

3.5 E) OBB detection head raw grids (dense, per-voxel predictions)

These are the “dense detection maps” before decoding into box lists.

3.5.1 cent_pr(B, 1, D, H, W)

What it is: Center probability map after sigmoid: cent_pr = sigmoid(cent_logits). (GitHub) Use for VIN? Maybe. It encodes “objectness / box centers.”

  • If your RRI is purely geometric, it’s optional.
  • If you want an object-biased NBV later, it’s a nice dense cue.

3.5.2 bbox_pr(B, 7, D, H, W)

What it is: Per-voxel bounding box parameters (EVL comment: height, width, depth, offset_h, offset_w, offset_d, yaw). It is post-processed with bounded transforms: sizes from sigmoid into [bbox_min, bbox_max], offsets with tanh scaled by offset_max, yaw with tanh scaled by yaw_max. (GitHub) Use for VIN? No for v0.1. It’s harder to use correctly and tends to overcomplicate a minimal scorer. Keep it for entity-aware extensions if needed.

3.5.3 clas_pr(B, 29, D, H, W)

What it is: Per-voxel semantic class probabilities via softmax over classes. (GitHub) Use for VIN? No for geometry-only v0.1. For entity-aware NBV, you could use it, but you’ll likely prefer the decoded OBB list + class probs instead of this dense map.


3.6 F) Decoded OBB outputs (post-process + NMS; token-like)

3.6.1 obbs_pr_nmsObbTW (B, K, 34) (here K=128)

What it is: Top-K decoded predicted boxes (after NMS) in voxel coordinates (this is the output right after voxel2obb(...); simple_nms3d(...)). (GitHub) Use for VIN? Future (entity-aware). Useful if you want candidate scoring based on predicted objects, but not required for basic RRI prediction.

3.6.2 cent_pr_nms(B, 1, D, H, W)

What it is: The center map after applying 3D NMS suppression. (GitHub) Use for VIN? No. Keep for debugging.

3.6.3 obbs/predObbTW (B, K, 34)

What it is: Predicted OBBs transformed into snippet coordinates (EVL computes T_snippet←voxel and transforms the NMS boxes). (GitHub) Use for VIN? Future (entity-aware). This is the version you’d want if you need consistency with your snippet/rig reference frames.

3.6.4 obbs/pred_vizObbTW (B, K, 34)

What it is: A visualization-friendly variant (typically same boxes but potentially adjusted for plotting conventions). (GitHub) Use for VIN? No.

3.6.5 obbs/pred/probs_full — list length B of Tensor

What it is: Per-box full class probability vectors corresponding to decoded boxes (EVL stores them as a Python list per batch element). (GitHub) Use for VIN? Future. If you treat boxes as tokens, you’ll use these for semantic weighting.

3.6.6 obbs/pred/sem_id_to_name — dict

What it is: Class-id to name mapping for visualization/debug. (GitHub) Use for VIN? No (not a feature).

(Your probs_ful_viz key looks like a typo in the printout; EVL stores a probs_full list — verify naming on your side.) (GitHub)


3.7 G) 2D backbone artifacts (almost always “don’t use” for VIN)

3.7.1 rgb/feat2d_upsampled(B, T, C, H, W) (here 1×20×32×288×288)

What it is: The upsampled 2D features used for lifting into voxels. EVL also returns these for visualization/debug. (GitHub) Use for VIN? No. Too heavy and breaks the “VIN purely on 3D voxel repr” principle.

3.7.2 rgb/token2d — list of tensors

What it is: Debug/visualization copies of 2D backbone outputs; EVL may detach and move these to CPU when needed (especially multi-layer features). (GitHub) Use for VIN? No. Risk of device mismatches + unnecessary memory.


4 Concrete “what to use” recommendation for VIN v0.1 (head-centric, RRI-focused)

If you want a minimal but inductive-bias-aligned feature set:

4.1 Use these voxel-aligned volumes (all sampleable in candidate frusta)

  • occ_pr (1ch): predicted occupancy probability. (GitHub)
  • voxel/occ_input (1ch): observed occupied evidence from semi-dense points. (GitHub)
  • free_input = voxel/feat[:, -1:] (1ch): observed free-space evidence (ray sampled). (GitHub)
  • counts = voxel/counts normalized to [0,1]: observation coverage. (GitHub)

Optional:

  • cent_pr (1ch): objectness density cue. (GitHub)

4.2 Plus the required coordinate contract

  • voxel/T_world_voxel, voxel_extent. (GitHub)

4.3 Skip for now

  • bbox_pr, clas_pr, decoded OBBs (unless you explicitly do entity-aware NBV). (GitHub)
  • all rgb/* artifacts. (GitHub)

This gives you a 4–5 channel 48³ grid — extremely manageable — that still captures the things RRI cares about: occupied vs free vs unknown + how well it’s been observed.


5 Why this aligns with RRI and your oracle pipeline

Your oracle RRI is ultimately measuring change in point↔︎mesh distances after adding candidate-view points. The EVL outputs above provide:

  • What’s already reconstructed: occ_input (points) + free_input (observed empty space). (GitHub)
  • What EVL believes exists / surfaces: occ_pr. (GitHub)
  • Where the model had coverage: counts. (GitHub)

And all of these are in the same gravity-aligned voxel frame, so frustum sampling is straightforward and cheap. (GitHub)


5.1 EVL Architecture

  • Frozen 2D foundation encoder: VideoBackboneDinov2 upsamples DinoV2.5 tokens to full resolution per stream, handling RGB and SLAM cameras with optional vignette correction (efm3d/model/video_backbone.py:132).
  • Egocentric voxel lifter: The Lifter class samples voxel centers into every frame, aggregates valid projections per stream, and appends binary point/free-space masks derived from semi-dense point clouds (efm3d/model/lifter.py:180). Aggregation uses a masked mean over time/streams while tracking valid sample counts (efm3d/model/lifter.py:379), and the voxel frame is gravity-aligned via inertial cues (efm3d/model/lifter.py:345).
  • 3D neck and heads: EVL instantiates an InvResnetFpn3d neck followed by task heads for occupancy, centerness, box regression, and classification (efm3d/model/evl.py:35). Bounding boxes are parameterized as sizes, offsets, and yaw relative to the voxel cell, with logits converted through constrained activations (efm3d/model/evl.py:200).
  • Post-processing and persistence: Voxel logits are thinned with 3D NMS and converted to sparse OBB predictions in snippet coordinates (efm3d/model/evl.py:137). Sequence-level fusion is handled downstream by volumetric and tracking utilities (efm3d/inference/fuse.py:55).

5.2 Multi-task Losses & Training Signals

  • Occupancy supervision combines sigmoid cross-entropy with total-variation regularization to encourage smooth surfaces (efm3d/utils/evl_loss.py:150).
  • OBB training blends focal loss on centerness/class logits, constrained box regression, and a rotated 3D IoU loss computed on gravity-aligned 7-DoF boxes (efm3d/utils/evl_loss.py:90).
  • Semi-dense point clouds and free-space samples provide geometric supervision signals by rasterizing both into the voxel lattice and masking invalid voxels, ensuring supervision focuses on observed regions (efm3d/model/lifter.py:262).

5.3 Assumptions & Failure Modes

  • Accurate poses & calibration: Lifting assumes precise extrinsics/intrinsics; pose drift or intrinsics errors misplace features and degrade both heads.
  • Static scenes: Dynamic objects violate the “semi-dense point = stable surface” assumption and can pollute masks; temporal filtering or motion segmentation is required in real deployments.
  • Mask imbalance: Freespace vastly outweighs surface voxels; improper loss weighting can bias the model toward predicting empty space.
  • Resolution vs memory: Larger \(D\times H\times W\) improve detail but grow cubic memory; mixed precision and stream-wise channel reduction are used in practice. Small version of EVL can be inferenced on an RTX 3080.
  • Inherent streaming capabilites: EVL is not designed for online streaming inference, where previous encodings can be reused.

5.4 EVL as a Backbone for VIN-style RRI Prediction

  1. Scene state encoding: Use EVL’s per-snippet voxel/feat, voxel/counts, and voxel/occ_input tensors as latent inputs for a VIN head. Counts expose coverage; occupancy logits and entropy/variance maps (computed across time or streams) become uncertainty cues; centerness and class volumes provide object-local importance signals [2].
  2. Safety-constrained action space: Treat high \(\mathbf{V}_{\text{occ}}\) probabilities, fused OBB volumes, and accumulated free-space masks as hard feasibility constraints when sampling NBV poses. Reject candidate waypoints that fall inside solids, intersect predicted support surfaces, or violate robot clearance limits before scoring.
  3. Coverage & persistence tracking: Persist EVL outputs into a global log-odds occupancy grid or mesh via VolumeFusion (efm3d/inference/fuse.py:187). Maintain per-voxel coverage (voxel/counts), visitation timestamps, and uncertainty, enabling RRI computation, diminishing-return penalties, and stop criteria.
  4. Candidate view scoring: Use EVL ray utilities (ray_obb_intersection, sample_depths_in_grid) to simulate visibility from proposed poses (efm3d/utils/ray.py:99). Features such as expected newly-seen surface area, occlusion probability, and semantic novelty can augment VIN inputs and correlate with RRI or information gain.
  5. Entity-aware & task-aware RRI: Combine \(\mathbf{V}_{\text{class}}\) with tracked OBB semantics to compute class-conditional RRIs, aligning with GenNBV’s multi-source embeddings while retaining VIN’s reconstruction-driven objective [3]. Weight RRIs according to task priorities (e.g., movable assets, structural safety checks).

5.5 Implementation Sources

These components give us a ready-made scene encoder whose volumetric features, semantic predictions, and ray utilities map cleanly onto our NBV pipeline: EVL provides the shared backbone, VIN supplies the lightweight RRI head, and ASE meshes deliver oracle supervision for training.

References

[1]
J. Straub, D. DeTone, T. Shen, N. Yang, C. Sweeney, and R. Newcombe, “EFM3D: A benchmark for measuring progress towards 3D egocentric foundation models.” 2024. Available: https://arxiv.org/abs/2406.10224
[2]
N. Frahm et al., “VIN-NBV: A view introspection network for next-best-view selection.” 2025. Available: https://arxiv.org/abs/2505.06219
[3]
X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang, “GenNBV: Generalizable next-best-view policy for active 3D reconstruction.” 2024. Available: https://arxiv.org/abs/2402.16174