SceneScript

1 SceneScript: Structured Scene Language As Semantic Bridge

Primary source. SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [1].

Local source. main.tex, sections/dataset.tex, and sections/structured_scene_language.tex.

Related ARIA-NBV pages. Project Aria, EFM3D/EVL, and target-aware research questions.

1.0.1 Core contribution

SceneScript represents indoor scenes as a structured language of geometric commands. Instead of producing only a dense mesh or point cloud, the model predicts an editable sequence of entities such as walls, doors, windows, and object bounding boxes [1].

The ARIA-NBV relevance is semantic structure, not immediate planner replacement. SceneScript shows that ASE scale, MPS-style point clouds, and typed scene primitives can support a compact global representation that might later guide target selection or semantic subgoals.

1.0.2 Verified paper signals

signal source-backed detail ARIA-NBV relevance
Structured language Scenes are encoded as commands such as make_wall, make_door, make_window, and make_bbox. Useful stretch representation for named targets, layout regions, and semantic planning.
Input evidence The model consumes point clouds from Project Aria / MPS-style reconstruction, discretized to a fixed spatial resolution. Aligns with ARIA-NBV’s semi-dense observed-state premise.
Architecture A sparse 3D encoder feeds an autoregressive Transformer decoder with grammar/type constraints. Suggests a future sequence representation, not a required thesis component.
ASE scale SceneScript uses large-scale ASE scenes and annotations. Supports ASE as the semantic/global planning ecosystem around ARIA-NBV.
Editability Structured commands can be inspected and edited more directly than opaque dense features. Useful for human-facing target definitions and future semantic diagnostics.

Example command types:

command role
make_wall architectural wall primitive
make_door / make_window openings attached to walls
make_bbox object-level bounding box/entity
make_prim extensible primitive command in later variants

1.0.3 ARIA-NBV adoption

  • Stretch/bridge: SceneScript is a semantic/global representation layer after observed target selection, target RRI labels, finite-candidate rollouts, and Q_H training are stable.
  • Proposal/diagnostic: typed scene entities can help organize reports by walls, objects, doors, or rooms.
  • Future target interface: structured commands could become a human-readable target source, but must be matched to observed/predicted OBB or support evidence before being actor-visible.

1.0.4 Do not adopt

  • Do not make SceneScript a thesis-core dependency.
  • Do not use GT semantic commands or GT entity layouts as actor-visible inputs in the main ARIA-NBV protocol.
  • Do not assume high-level layout correctness is enough for fine-detail target-RRI supervision.
  • Do not replace geometric validity, candidate masks, or mesh-supervised RRI with language-level scene completeness.

1.0.5 Open risks / caveats

  • Structured language can miss fine geometry that matters for Chamfer-style reconstruction quality.
  • Command vocabularies and discretization choices constrain what can be represented.
  • SceneScript is valuable as a future semantic/global planning bridge, but the current thesis must first prove target-conditioned quality-driven NBV on ASE/EFM evidence.

References

[1]
A. Avetisyan et al., “SceneScript: Reconstructing scenes with an autoregressive structured language model.” 2024. Available: https://arxiv.org/abs/2403.13064