Project Aria
1 Project Aria: Actor-Visible Egocentric Sensing
Primary source. Project Aria: A New Tool for Egocentric Multi-Modal AI Research [1].
Local source. main.tex, device.tex, tools.tex, and mps.tex.
Related ARIA-NBV pages. EFM3D/EVL, ASE dataset, semi-dense point clouds, and oracle RRI API.
1.1 Core contribution
Project Aria is not a NBV planner. It is the egocentric sensing contract behind ARIA-NBV. The paper introduces a wearable research device with calibrated, time-aligned multimodal streams and a tooling/MPS stack for trajectories, calibration, semi-dense mapping, gaze, and related perception products [1].
For ARIA-NBV, the key point is boundary-setting: Project Aria-style observations define what the actor may plausibly see, while ASE meshes and GT annotations provide offline supervision/evaluation only.
1.2 Verified paper signals
| signal | source-backed detail | ARIA-NBV relevance |
|---|---|---|
| Sensor suite | The device includes RGB, SLAM, eye-tracking, IMU, audio, and other sensor streams with calibration/time-alignment requirements. | Candidate scoring must respect calibrated egocentric camera streams rather than assuming perfect RGB-D input. |
| VRS/tooling | Project Aria records and exposes sensor data through VRS and Project Aria tools. | Cache lineage should preserve stream identity, calibration, and source versions. |
| MPS trajectories | MPS produces closed-loop trajectories and pose products. | Provides the actor-visible pose/reconstruction state for ASE-style snippets. |
| Online calibration | The MPS/tooling stack handles online calibration products. | Frame lineage and calibration source must be stored with rollouts and offline labels. |
| Semi-dense point clouds | MPS produces semi-dense maps rather than dense GT geometry. | ARIA-NBV’s current reconstruction proxy is semi-dense point support, not a full mesh. |
1.3 ARIA-NBV adoption
The actor-visible state should stay limited to deployment-plausible evidence:
| actor-visible input | examples |
|---|---|
| calibrated streams | RGB/SLAM images, intrinsics, extrinsics, time alignment |
| pose and history | current/historical rig poses, MTD, selected view history |
| reconstruction proxy | semi-dense PC, visibility/support metadata |
| learned state | EVL occupancy/evidence, predicted OBB support |
| candidates | candidate poses, candidate cameras, feasibility and mask metadata |
GT meshes, GT OBBs, GT masks, and dense target crops are offline oracle/evaluation assets. They can produce labels, but they should not enter the actor-visible input for the main OBS-SEL / PRED-Q / GT-EVAL protocol.
1.4 Do not adopt
- Do not assume dense depth at inference.
- Do not treat MPS/semi-dense points as ground truth; they are observed reconstruction evidence.
- Do not leak ASE meshes or GT object boxes into target selection or candidate scoring.
- Do not claim real-time AR guidance before incremental state updates and lightweight scoring are demonstrated.
1.5 Open risks / caveats
- Calibration, frame, and timestamp mistakes can make visually plausible recordings but invalid labels.
- Semi-dense point support can be sparse or biased; RRI must remain explicit about the reconstruction proxy.
- Any rollout/Q store should preserve source lineage: stream, pose, calibration, mesh/version hash, and candidate-generation config.