1 SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

SceneScript@SceneScript-avetisyan2024 represents a paradigm shift in scene understanding by using structured language to describe complete 3D layouts.

1.1 Local Sources

main.tex – ECCV manuscript entry point.
sections/dataset.tex – ASE dataset specification used for training SceneScript.
sections/structured_scene_language.tex – command grammar & token definitions.
figs/Main-Figure.pdf – canonical teaser referenced in this page.
supp.tex – supplemental experiments and qualitative visualizations.

1.2 Key Contributions

Structured Scene Language: Represents scenes as sequences of language-like commands (walls, doors, windows, objects)
Autoregressive Transformer: Generates scene descriptions token-by-token, enabling streaming and partial updates
Aria Synthetic Environments (ASE): Introduces 100K synthetic indoor scenes for training
Entity-Level Representation: Provides explicit geometric primitives (hulls, bounding boxes) for architectural elements

Figure 1: SceneScript Commands

1.2.0.1 Scene Representation

The SceneScript language encodes:

```plaintext
make_wall, id=0, a_x=-2.56, a_y=6.16, a_z=0.0, b_x=5.07, b_y=6.16, b_z=0.0, height=3.26, thickness=0.0
make_door, id=1000, wall0_id=2, wall1_id=4, position_x=-1.51, position_y=1.84, position_z=1.01, width=1.82, height=2.02
make_window, id=2000, wall0_id=0, position_x=4.45, position_y=6.16, position_z=1.64, width=1.01, height=2.12

SceneScript Language Format

1.2.0.2 Architecture


#### Architecture Details

SceneScript uses a two-stage architecture combining geometric encoding with autoregressive structured language generation:

##### 1. Sparse 3D ResNet Encoder

**Input Processing**:

- **Point Cloud Discretization**: Input point clouds discretized to **5cm resolution**
- **Sparse Representation**: Uses `torchsparse` for efficient handling of sparse 3D data
- **Preprocessing**: Points normalized and voxelized

**Architecture**:

- **ResNet-Style Encoder**: Sparse 3D ResNet using sparse convolutions
- **Downsampling**: **5 down convolutional layers** with kernel size 3 and stride 2
- **Point Reduction**: Reduces number of active sites by ~1000×
- **Parameters**: ~20M optimizable parameters

**Output**: Sparse feature tensor → converted to sequence of features $\mathbf{F} \in \mathbb{R}^{N \times d_{model}}$ where $N$ is number of non-empty voxels (much smaller than input)

**Coordinate Encoding**:

- Active site coordinates appended to feature vectors: $\mathbf{f}_i \leftarrow \text{cat}(\mathbf{f_i}, \mathbf{c_i})$
- Features sorted lexicographically by coordinate
- Provides positional information for transformer decoder

##### 2. Transformer Decoder

**Token Types**:
SceneScript uses **typed tokens** where each token has both a *value* and a *type*:

- **Value Tokens**: Discretized parameters (e.g., wall corner position)
- **Type Tokens**: Indicate semantic role (e.g., `MAKE_WALL_A_X`, `COMMAND`, `PART`, `STOP`)

**Embeddings**:

embedding = position_emb + value_emb + type_emb


**Autoregressive Generation**:

1. Start with `<START>` token
2. For each timestep $t$:
   - Feed sequence $\{\text{token}_1, ..., \text{token}_t\}$ to decoder
   - Attend to encoded point cloud features (cross-attention)
   - Predict next token value via softmax over $N_{bins} + 6$ special tokens
   - Decode type of next token based on grammar rules
3. Generate until `<STOP>` token

**Type-Guided Decoding**:

- Entity generation follows strict grammar:

→ → param_1 → param_2 → … → param_n → ```

Type token determines valid next parameters (e.g., after make_wall expect a_x, a_y, …)
Enforces structural consistency during generation

Decoder Details:

8 transformer decoder layers (fixed architecture)
8 attention heads for multi-head attention
Feature dimension: 512 ($d_{model} = 512$)
Parameters: ~35M optimizable parameters
Vocabulary size: 2048 tokens (also max sequence length)
Causal self-attention mask for autoregressive generation
Cross-attention to encoded point cloud features

1.2.0.3 Entity Types and Parameters

SceneScript defines four primitive types with specific parameter sets:

1.2.0.3.1 Wall Entity

PARAMS = {
    'id': int,           # Unique identifier
    'a_x', 'a_y', 'a_z': float,  # Corner A (meters)
    'b_x', 'b_y', 'b_z': float,  # Corner B (meters)
    'height': float,     # Wall height (meters)
    'thickness': float   # Always 0.0 (legacy)
}

Defines 3D line segment with vertical extrusion
Implicitly defines floor/ceiling at z=0 and z=height

1.2.0.3.2 Door Entity

PARAMS = {
    'id': int,
    'wall0_id': int,     # Parent wall reference
    'wall1_id': int,     # (duplicate for legacy)
    'position_x', 'position_y', 'position_z': float,  # Center
    'width': float,      # Door width
    'height': float      # Door height
}

Attached to parent wall
Oriented parallel to wall

1.2.0.3.3 Window Entity

Inherits same parameters as Door
Differentiated only by command type

1.2.0.3.4 Bbox Entity (Objects)

PARAMS = {
    'id': int,
    'class': str,        # Object category (e.g., 'chair', 'table')
    'position_x', 'position_y', 'position_z': float,  # Center
    'angle_z': float,    # Rotation around Z-axis (radians)
    'scale_x', 'scale_y', 'scale_z': float  # Bounding box size
}

Represents furniture and objects
Oriented bounding box with Z-axis rotation

1.2.0.4 Input/Output Formulation

Input:

Point Cloud: $\mathbf{P} \in \mathbb{R}^{N_p \times 3}$ from MPS semi-dense SLAM
Discretization: Points discretized to 5cm resolution
Features: XYZ coordinates serve as input features

Output:

Token Sequence: $\{\text{tok}_1, \text{tok}_2, ..., \text{tok}_T\}$ where $T \leq T_{max} = 2048$

Structured Language: Each entity encoded as:

<PART> <CMD> <param_1> <param_2> ... <param_n> <PART> ...

Example:
```
<PART> make_wall 42 156 200 0 198 312 200 0 255 0 <PART>
```
(Discretized integer tokens, later undiscretized to continuous floats)

Tokenization:

Integer parameters: $t = \text{int}(x)$
Float parameters: $t = \text{round}(x / \text{res})$ where res = 5cm resolution
Vocabulary: 2048 tokens maximum
Not BPE-based (unlike NLP); custom discretization scheme

Post-Processing:

Parse token sequence into entities
Undiscretize parameters back to continuous values
Assign doors/windows to nearest wall
Translate back to original coordinate frame

1.2.0.5 Training Methodology

Dataset: 100K ASE synthetic scenes

Train/Val/Test: 80K / 10K / 10K split
Augmentation: Random rotations, translations, point subsampling

Loss Function:

Cross-Entropy: On discretized token predictions
Type-Aware: Separate losses for each token type
Weighted: Higher weight on command tokens vs parameters

Training Strategy:

Teacher Forcing: During training, feed ground truth previous tokens
Nucleus Sampling: At inference, use nucleus sampling (top-p) for diversity
Greedy Decoding: Quantitative results decoded greedily
Augmentation: Random z-axis rotation (360°), random point subsampling (up to 500K points)

Optimization:

Optimizer: AdamW with learning rate 1e-4 (10^-3 for image-only encoder variant)
Batch Size: 64 scenes (effective batch size, distributed across multiple nodes)
Training Time: ~3-4 days (hardware not specified in paper)
Convergence: ~200K iterations
Loss: Standard cross-entropy on next token prediction

1.2.1 Relevance to Our Work

SceneScript provides a semantic backbone for NBV planning:

Entity-Level Primitives: Explicit geometric hulls enable per-entity RRI computation
Streaming Capability: Autoregressive generation allows incremental scene updates as more views are acquired
Structured Representation: Entity parameters provide targets for reconstruction quality metrics
Semantic Awareness: Different entity types (walls vs doors vs objects) can be weighted differently for NBV

Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can:

Compute RRI separately for each wall, door, or object
Prioritize views that improve reconstruction of specific entities
Integrate user-specified importance weights per entity type
Track reconstruction completeness at entity-level granularity

NBV Integration Strategy:

Run SceneScript on partial point cloud → get predicted entities
Compare predicted vs (incrementally estimated) ground truth entities
Compute per-entity reconstruction error
Candidate view RRI = expected reduction in entity errors visible from that view
Select view maximizing weighted sum of entity-level RRI

SceneScript provides:

Semantic Scene Understanding: Explicit representation of walls, doors, windows, and objects
Entity-Level Primitives: Geometric hulls that can be used for entity-specific RRI computation
Streaming Capability: Autoregressive generation allows incremental scene updates
Backbone for NBV: Potential to serve as a semantic encoder for NBV prediction

Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can compute RRI separately for each room, wall, or object of interest.

1.2.2 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

[1]

This recent work extends SceneScript with interactive refinement capabilities, enabling human-in-the-loop corrections.

1.2.2.1 Key Contributions

Multi-Task SceneScript: Jointly trains global scene prediction and local infilling tasks
Interactive Refinement: Users can click erroneous regions for localized re-generation
Improved Local Accuracy: Significantly better reconstruction quality in user-specified regions
Beyond Training Distribution: Enables layouts that diverge from training data through iterative refinement

1.2.2.2 Human-in-the-Loop Workflow

Initial Prediction: Generate full scene layout from point cloud
User Feedback: Identify regions requiring correction
Local Infilling: Re-generate only the problematic area while maintaining global consistency
Iterative Refinement: Repeat until satisfactory

1.2.2.3 Relevance to Our Work

This work demonstrates:

Interactive Scene Understanding: Users can specify regions/entities of interest
Partial Updates: Ability to refine specific parts without full re-processing
Quality-Aware Refinement: System can focus computational resources where needed

Application to NBV: We can extend this paradigm to NBV planning where users select entities requiring higher reconstruction quality, and the system prioritizes views that improve those specific regions.

References

[1]

C. Xie et al., “Human-in-the-loop local corrections of 3D scene layouts via infilling.” 2025. Available: https://arxiv.org/abs/2503.11806

# SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model SceneScript@SceneScript-avetisyan2024 represents a paradigm shift in scene understanding by using structured language to describe complete 3D layouts. ## Local \LaTeX{} Sources - [`main.tex`](../../literature/tex-src/arXiv-scene-script/main.tex) – ECCV manuscript entry point. - [`sections/dataset.tex`](../../literature/tex-src/arXiv-scene-script/sections/dataset.tex) – ASE dataset specification used for training SceneScript. - [`sections/structured_scene_language.tex`](../../literature/tex-src/arXiv-scene-script/sections/structured_scene_language.tex) – command grammar & token definitions. - [`figs/Main-Figure.pdf`](../../literature/tex-src/arXiv-scene-script/figs/Main-Figure.pdf) – canonical teaser referenced in this page. - [`supp.tex`](../../literature/tex-src/arXiv-scene-script/supp.tex) – supplemental experiments and qualitative visualizations. ## Key Contributions - **Structured Scene Language**: Represents scenes as sequences of language-like commands (walls, doors, windows, objects) - **Autoregressive Transformer**: Generates scene descriptions token-by-token, enabling streaming and partial updates - **Aria Synthetic Environments (ASE)**: Introduces 100K synthetic indoor scenes for training - **Entity-Level Representation**: Provides explicit geometric primitives (hulls, bounding boxes) for architectural elements ![SceneScript Commands](../../figures/commands.pdf){#fig-scenescript-commands width=95%} #### Scene Representation The SceneScript language encodes: ```plaintext ```plaintext make_wall, id=0, a_x=-2.56, a_y=6.16, a_z=0.0, b_x=5.07, b_y=6.16, b_z=0.0, height=3.26, thickness=0.0 make_door, id=1000, wall0_id=2, wall1_id=4, position_x=-1.51, position_y=1.84, position_z=1.01, width=1.82, height=2.02 make_window, id=2000, wall0_id=0, position_x=4.45, position_y=6.16, position_z=1.64, width=1.01, height=2.12 ``` ![SceneScript Language Format](../../figures/commands.pdf){width=100%} #### Architecture ``` #### Architecture Details SceneScript uses a two-stage architecture combining geometric encoding with autoregressive structured language generation: ##### 1. Sparse 3D ResNet Encoder **Input Processing**: - **Point Cloud Discretization**: Input point clouds discretized to **5cm resolution** - **Sparse Representation**: Uses `torchsparse` for efficient handling of sparse 3D data - **Preprocessing**: Points normalized and voxelized **Architecture**: - **ResNet-Style Encoder**: Sparse 3D ResNet using sparse convolutions - **Downsampling**: **5 down convolutional layers** with kernel size 3 and stride 2 - **Point Reduction**: Reduces number of active sites by ~1000× - **Parameters**: ~20M optimizable parameters **Output**: Sparse feature tensor → converted to sequence of features $\mathbf{F} \in \mathbb{R}^{N \times d_{model}}$ where $N$ is number of non-empty voxels (much smaller than input) **Coordinate Encoding**: - Active site coordinates appended to feature vectors: $\mathbf{f}_i \leftarrow \text{cat}(\mathbf{f_i}, \mathbf{c_i})$ - Features sorted lexicographically by coordinate - Provides positional information for transformer decoder ##### 2. Transformer Decoder **Token Types**: SceneScript uses **typed tokens** where each token has both a *value* and a *type*: - **Value Tokens**: Discretized parameters (e.g., wall corner position) - **Type Tokens**: Indicate semantic role (e.g., `MAKE_WALL_A_X`, `COMMAND`, `PART`, `STOP`) **Embeddings**: ``` embedding = position_emb + value_emb + type_emb ``` **Autoregressive Generation**: 1. Start with `<START>` token 2. For each timestep $t$: - Feed sequence $\{\text{token}_1, ..., \text{token}_t\}$ to decoder - Attend to encoded point cloud features (cross-attention) - Predict next token value via softmax over $N_{bins} + 6$ special tokens - Decode type of next token based on grammar rules 3. Generate until `<STOP>` token **Type-Guided Decoding**: - Entity generation follows strict grammar: ``` <PART> → <COMMAND> → param_1 → param_2 → ... → param_n → <PART> ``` - Type token determines valid next parameters (e.g., after `make_wall` expect `a_x`, `a_y`, ...) - Enforces structural consistency during generation **Decoder Details**: - **8 transformer decoder layers** (fixed architecture) - **8 attention heads** for multi-head attention - **Feature dimension**: 512 ($d_{model} = 512$) - **Parameters**: ~35M optimizable parameters - **Vocabulary size**: 2048 tokens (also max sequence length) - Causal self-attention mask for autoregressive generation - Cross-attention to encoded point cloud features ![SceneScript Diagram](../../figures/scenescript_diagram.png){#fig-scenescript-diagram width=100%} #### Entity Types and Parameters SceneScript defines four primitive types with specific parameter sets: ##### Wall Entity ```python PARAMS = { 'id': int, # Unique identifier 'a_x', 'a_y', 'a_z': float, # Corner A (meters) 'b_x', 'b_y', 'b_z': float, # Corner B (meters) 'height': float, # Wall height (meters) 'thickness': float # Always 0.0 (legacy) } ``` - Defines 3D line segment with vertical extrusion - Implicitly defines floor/ceiling at z=0 and z=height ##### Door Entity ```python PARAMS = { 'id': int, 'wall0_id': int, # Parent wall reference 'wall1_id': int, # (duplicate for legacy) 'position_x', 'position_y', 'position_z': float, # Center 'width': float, # Door width 'height': float # Door height } ``` - Attached to parent wall - Oriented parallel to wall ##### Window Entity - Inherits same parameters as Door - Differentiated only by command type ##### Bbox Entity (Objects) ```python PARAMS = { 'id': int, 'class': str, # Object category (e.g., 'chair', 'table') 'position_x', 'position_y', 'position_z': float, # Center 'angle_z': float, # Rotation around Z-axis (radians) 'scale_x', 'scale_y', 'scale_z': float # Bounding box size } ``` - Represents furniture and objects - Oriented bounding box with Z-axis rotation #### Input/Output Formulation **Input**: - **Point Cloud**: $\mathbf{P} \in \mathbb{R}^{N_p \times 3}$ from MPS semi-dense SLAM - **Discretization**: Points discretized to **5cm resolution** - **Features**: XYZ coordinates serve as input features **Output**: - **Token Sequence**: $\{\text{tok}_1, \text{tok}_2, ..., \text{tok}_T\}$ where $T \leq T_{max} = 2048$ - **Structured Language**: Each entity encoded as: ``` <PART> <CMD> <param_1> <param_2> ... <param_n> <PART> ... ``` - **Example**: ``` <PART> make_wall 42 156 200 0 198 312 200 0 255 0 <PART> ``` (Discretized integer tokens, later undiscretized to continuous floats) **Tokenization**: - **Integer parameters**: $t = \text{int}(x)$ - **Float parameters**: $t = \text{round}(x / \text{res})$ where res = 5cm resolution - **Vocabulary**: 2048 tokens maximum - Not BPE-based (unlike NLP); custom discretization scheme **Post-Processing**: 1. Parse token sequence into entities 2. Undiscretize parameters back to continuous values 3. Assign doors/windows to nearest wall 4. Translate back to original coordinate frame #### Training Methodology **Dataset**: 100K ASE synthetic scenes - **Train/Val/Test**: 80K / 10K / 10K split - **Augmentation**: Random rotations, translations, point subsampling **Loss Function**: - **Cross-Entropy**: On discretized token predictions - **Type-Aware**: Separate losses for each token type - **Weighted**: Higher weight on command tokens vs parameters **Training Strategy**: - **Teacher Forcing**: During training, feed ground truth previous tokens - **Nucleus Sampling**: At inference, use nucleus sampling (top-p) for diversity - **Greedy Decoding**: Quantitative results decoded greedily - **Augmentation**: Random z-axis rotation (360°), random point subsampling (up to 500K points) **Optimization**: - **Optimizer**: AdamW with **learning rate 1e-4** (10^-3 for image-only encoder variant) - **Batch Size**: **64 scenes** (effective batch size, distributed across multiple nodes) - **Training Time**: **~3-4 days** (hardware not specified in paper) - **Convergence**: **~200K iterations** - **Loss**: Standard cross-entropy on next token prediction ### Relevance to Our Work SceneScript provides a **semantic backbone** for NBV planning: 1. **Entity-Level Primitives**: Explicit geometric hulls enable per-entity RRI computation 2. **Streaming Capability**: Autoregressive generation allows incremental scene updates as more views are acquired 3. **Structured Representation**: Entity parameters provide targets for reconstruction quality metrics 4. **Semantic Awareness**: Different entity types (walls vs doors vs objects) can be weighted differently for NBV **Key Insight**: SceneScript's structured representation enables **entity-aware NBV planning** where we can: - Compute RRI separately for each wall, door, or object - Prioritize views that improve reconstruction of specific entities - Integrate user-specified importance weights per entity type - Track reconstruction completeness at entity-level granularity **NBV Integration Strategy**: 1. Run SceneScript on partial point cloud → get predicted entities 2. Compare predicted vs (incrementally estimated) ground truth entities 3. Compute per-entity reconstruction error 4. Candidate view RRI = expected reduction in entity errors visible from that view 5. Select view maximizing weighted sum of entity-level RRI SceneScript provides: - **Semantic Scene Understanding**: Explicit representation of walls, doors, windows, and objects - **Entity-Level Primitives**: Geometric hulls that can be used for entity-specific RRI computation - **Streaming Capability**: Autoregressive generation allows incremental scene updates - **Backbone for NBV**: Potential to serve as a semantic encoder for NBV prediction **Key Insight**: SceneScript's structured representation enables **entity-aware NBV planning** where we can compute RRI separately for each room, wall, or object of interest. ### Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling [@HITL-SceneScript-xie2025] This recent work extends SceneScript with interactive refinement capabilities, enabling human-in-the-loop corrections. #### Key Contributions - **Multi-Task SceneScript**: Jointly trains global scene prediction and local infilling tasks - **Interactive Refinement**: Users can click erroneous regions for localized re-generation - **Improved Local Accuracy**: Significantly better reconstruction quality in user-specified regions - **Beyond Training Distribution**: Enables layouts that diverge from training data through iterative refinement #### Human-in-the-Loop Workflow 1. **Initial Prediction**: Generate full scene layout from point cloud 2. **User Feedback**: Identify regions requiring correction 3. **Local Infilling**: Re-generate only the problematic area while maintaining global consistency 4. **Iterative Refinement**: Repeat until satisfactory #### Relevance to Our Work This work demonstrates: - **Interactive Scene Understanding**: Users can specify regions/entities of interest - **Partial Updates**: Ability to refine specific parts without full re-processing - **Quality-Aware Refinement**: System can focus computational resources where needed **Application to NBV**: We can extend this paradigm to NBV planning where users select entities requiring higher reconstruction quality, and the system prioritizes views that improve those specific regions.