SceneScript
1 SceneScript: Structured Scene Language As Semantic Bridge
Primary source. SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [1].
Local source. main.tex, sections/dataset.tex, and sections/structured_scene_language.tex.
Related ARIA-NBV pages. Project Aria, EFM3D/EVL, and target-aware research questions.
1.0.1 Core contribution
SceneScript represents indoor scenes as a structured language of geometric commands. Instead of producing only a dense mesh or point cloud, the model predicts an editable sequence of entities such as walls, doors, windows, and object bounding boxes [1].
The ARIA-NBV relevance is semantic structure, not immediate planner replacement. SceneScript shows that ASE scale, MPS-style point clouds, and typed scene primitives can support a compact global representation that might later guide target selection or semantic subgoals.
1.0.2 Verified paper signals
| signal | source-backed detail | ARIA-NBV relevance |
|---|---|---|
| Structured language | Scenes are encoded as commands such as make_wall, make_door, make_window, and make_bbox. |
Useful stretch representation for named targets, layout regions, and semantic planning. |
| Input evidence | The model consumes point clouds from Project Aria / MPS-style reconstruction, discretized to a fixed spatial resolution. | Aligns with ARIA-NBV’s semi-dense observed-state premise. |
| Architecture | A sparse 3D encoder feeds an autoregressive Transformer decoder with grammar/type constraints. | Suggests a future sequence representation, not a required thesis component. |
| ASE scale | SceneScript uses large-scale ASE scenes and annotations. | Supports ASE as the semantic/global planning ecosystem around ARIA-NBV. |
| Editability | Structured commands can be inspected and edited more directly than opaque dense features. | Useful for human-facing target definitions and future semantic diagnostics. |
Example command types:
| command | role |
|---|---|
make_wall |
architectural wall primitive |
make_door / make_window |
openings attached to walls |
make_bbox |
object-level bounding box/entity |
make_prim |
extensible primitive command in later variants |
1.0.3 ARIA-NBV adoption
- Stretch/bridge: SceneScript is a semantic/global representation layer after observed target selection, target RRI labels, finite-candidate rollouts, and Q_H training are stable.
- Proposal/diagnostic: typed scene entities can help organize reports by walls, objects, doors, or rooms.
- Future target interface: structured commands could become a human-readable target source, but must be matched to observed/predicted OBB or support evidence before being actor-visible.
1.0.4 Do not adopt
- Do not make SceneScript a thesis-core dependency.
- Do not use GT semantic commands or GT entity layouts as actor-visible inputs in the main ARIA-NBV protocol.
- Do not assume high-level layout correctness is enough for fine-detail target-RRI supervision.
- Do not replace geometric validity, candidate masks, or mesh-supervised RRI with language-level scene completeness.
1.0.5 Open risks / caveats
- Structured language can miss fine geometry that matters for Chamfer-style reconstruction quality.
- Command vocabularies and discretization choices constrain what can be represented.
- SceneScript is valuable as a future semantic/global planning bridge, but the current thesis must first prove target-conditioned quality-driven NBV on ASE/EFM evidence.