SceneScript

1 SceneScript: Structured Scene Language As Semantic Bridge

Primary source. SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [1].

Local source. main.tex, sections/dataset.tex, and sections/structured_scene_language.tex.

Related ARIA-NBV pages. Project Aria, EFM3D/EVL, and target-aware research questions.

1.0.1 Core contribution

SceneScript represents indoor scenes as a structured language of geometric commands. Instead of producing only a dense mesh or point cloud, the model predicts an editable sequence of entities such as walls, doors, windows, and object bounding boxes [1].

The ARIA-NBV relevance is semantic structure, not immediate planner replacement. SceneScript shows that ASE scale, MPS-style point clouds, and typed scene primitives can support a compact global representation that might later guide target selection or semantic subgoals.

1.0.2 Verified paper signals

signal	source-backed detail	ARIA-NBV relevance
Structured language	Scenes are encoded as commands such as `make_wall`, `make_door`, `make_window`, and `make_bbox`.	Useful stretch representation for named targets, layout regions, and semantic planning.
Input evidence	The model consumes point clouds from Project Aria / MPS-style reconstruction, discretized to a fixed spatial resolution.	Aligns with ARIA-NBV’s semi-dense observed-state premise.
Architecture	A sparse 3D encoder feeds an autoregressive Transformer decoder with grammar/type constraints.	Suggests a future sequence representation, not a required thesis component.
ASE scale	SceneScript uses large-scale ASE scenes and annotations.	Supports ASE as the semantic/global planning ecosystem around ARIA-NBV.
Editability	Structured commands can be inspected and edited more directly than opaque dense features.	Useful for human-facing target definitions and future semantic diagnostics.

Example command types:

command	role
`make_wall`	architectural wall primitive
`make_door` / `make_window`	openings attached to walls
`make_bbox`	object-level bounding box/entity
`make_prim`	extensible primitive command in later variants

1.0.3 ARIA-NBV adoption

Stretch/bridge: SceneScript is a semantic/global representation layer after observed target selection, target RRI labels, finite-candidate rollouts, and Q_H training are stable.
Proposal/diagnostic: typed scene entities can help organize reports by walls, objects, doors, or rooms.
Future target interface: structured commands could become a human-readable target source, but must be matched to observed/predicted OBB or support evidence before being actor-visible.

1.0.4 Do not adopt

Do not make SceneScript a thesis-core dependency.
Do not use GT semantic commands or GT entity layouts as actor-visible inputs in the main ARIA-NBV protocol.
Do not assume high-level layout correctness is enough for fine-detail target-RRI supervision.
Do not replace geometric validity, candidate masks, or mesh-supervised RRI with language-level scene completeness.

1.0.5 Open risks / caveats

Structured language can miss fine geometry that matters for Chamfer-style reconstruction quality.
Command vocabularies and discretization choices constrain what can be represented.
SceneScript is valuable as a future semantic/global planning bridge, but the current thesis must first prove target-conditioned quality-driven NBV on ASE/EFM evidence.

References

[1]

A. Avetisyan et al., “SceneScript: Reconstructing scenes with an autoregressive structured language model.” 2024. Available: https://arxiv.org/abs/2403.13064

--- title: "SceneScript" phase: thesis audience: public status: current owner: jan format: html --- # SceneScript: Structured Scene Language As Semantic Bridge {#scene-script} **Primary source.** [SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model](https://arxiv.org/abs/2403.13064) [@SceneScript-avetisyan2024]. **Local source.** [`main.tex`](../../literature/tex-src/arXiv-scene-script/main.tex), [`sections/dataset.tex`](../../literature/tex-src/arXiv-scene-script/sections/dataset.tex), and [`sections/structured_scene_language.tex`](../../literature/tex-src/arXiv-scene-script/sections/structured_scene_language.tex). **Related ARIA-NBV pages.** [Project Aria](project_aria.qmd), [EFM3D/EVL](efm3d.qmd), and [target-aware research questions](../thesis/questions.qmd). ### Core contribution SceneScript represents indoor scenes as a structured language of geometric commands. Instead of producing only a dense mesh or point cloud, the model predicts an editable sequence of entities such as walls, doors, windows, and object bounding boxes [@SceneScript-avetisyan2024]. The ARIA-NBV relevance is semantic structure, not immediate planner replacement. SceneScript shows that {{< gls aria-synthetic-environments >}} scale, {{< gls machine-perception-services >}}-style point clouds, and typed scene primitives can support a compact global representation that might later guide target selection or semantic subgoals. ### Verified paper signals | signal | source-backed detail | ARIA-NBV relevance | |---|---|---| | Structured language | Scenes are encoded as commands such as `make_wall`, `make_door`, `make_window`, and `make_bbox`. | Useful stretch representation for named targets, layout regions, and semantic planning. | | Input evidence | The model consumes point clouds from Project Aria / MPS-style reconstruction, discretized to a fixed spatial resolution. | Aligns with ARIA-NBV's semi-dense observed-state premise. | | Architecture | A sparse 3D encoder feeds an autoregressive Transformer decoder with grammar/type constraints. | Suggests a future sequence representation, not a required thesis component. | | ASE scale | SceneScript uses large-scale ASE scenes and annotations. | Supports ASE as the semantic/global planning ecosystem around ARIA-NBV. | | Editability | Structured commands can be inspected and edited more directly than opaque dense features. | Useful for human-facing target definitions and future semantic diagnostics. | Example command types: | command | role | |---|---| | `make_wall` | architectural wall primitive | | `make_door` / `make_window` | openings attached to walls | | `make_bbox` | object-level bounding box/entity | | `make_prim` | extensible primitive command in later variants | ### ARIA-NBV adoption - **Stretch/bridge:** SceneScript is a semantic/global representation layer after observed target selection, target RRI labels, finite-candidate rollouts, and {{< gls finite-horizon-q-function >}} training are stable. - **Proposal/diagnostic:** typed scene entities can help organize reports by walls, objects, doors, or rooms. - **Future target interface:** structured commands could become a human-readable target source, but must be matched to observed/predicted {{< gls oriented-bounding-box >}} or support evidence before being actor-visible. ### Do not adopt - Do not make SceneScript a thesis-core dependency. - Do not use GT semantic commands or GT entity layouts as actor-visible inputs in the main ARIA-NBV protocol. - Do not assume high-level layout correctness is enough for fine-detail target-RRI supervision. - Do not replace geometric validity, candidate masks, or mesh-supervised RRI with language-level scene completeness. ### Open risks / caveats - Structured language can miss fine geometry that matters for Chamfer-style reconstruction quality. - Command vocabularies and discretization choices constrain what can be represented. - SceneScript is valuable as a future semantic/global planning bridge, but the current thesis must first prove target-conditioned quality-driven NBV on ASE/EFM evidence.