1 SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

SceneScript@SceneScript-avetisyan2024 represents a paradigm shift in scene understanding by using structured language to describe complete 3D layouts.

1.1 Local Sources

1.2 Key Contributions

  • Structured Scene Language: Represents scenes as sequences of language-like commands (walls, doors, windows, objects)
  • Autoregressive Transformer: Generates scene descriptions token-by-token, enabling streaming and partial updates
  • Aria Synthetic Environments (ASE): Introduces 100K synthetic indoor scenes for training
  • Entity-Level Representation: Provides explicit geometric primitives (hulls, bounding boxes) for architectural elements
Figure 1: SceneScript Commands

1.2.0.1 Scene Representation

The SceneScript language encodes:

```plaintext
make_wall, id=0, a_x=-2.56, a_y=6.16, a_z=0.0, b_x=5.07, b_y=6.16, b_z=0.0, height=3.26, thickness=0.0
make_door, id=1000, wall0_id=2, wall1_id=4, position_x=-1.51, position_y=1.84, position_z=1.01, width=1.82, height=2.02
make_window, id=2000, wall0_id=0, position_x=4.45, position_y=6.16, position_z=1.64, width=1.01, height=2.12

SceneScript Language Format

1.2.0.2 Architecture


#### Architecture Details

SceneScript uses a two-stage architecture combining geometric encoding with autoregressive structured language generation:

##### 1. Sparse 3D ResNet Encoder

**Input Processing**:

- **Point Cloud Discretization**: Input point clouds discretized to **5cm resolution**
- **Sparse Representation**: Uses `torchsparse` for efficient handling of sparse 3D data
- **Preprocessing**: Points normalized and voxelized

**Architecture**:

- **ResNet-Style Encoder**: Sparse 3D ResNet using sparse convolutions
- **Downsampling**: **5 down convolutional layers** with kernel size 3 and stride 2
- **Point Reduction**: Reduces number of active sites by ~1000×
- **Parameters**: ~20M optimizable parameters

**Output**: Sparse feature tensor → converted to sequence of features $\mathbf{F} \in \mathbb{R}^{N \times d_{model}}$ where $N$ is number of non-empty voxels (much smaller than input)

**Coordinate Encoding**:

- Active site coordinates appended to feature vectors: $\mathbf{f}_i \leftarrow \text{cat}(\mathbf{f_i}, \mathbf{c_i})$
- Features sorted lexicographically by coordinate
- Provides positional information for transformer decoder

##### 2. Transformer Decoder

**Token Types**:
SceneScript uses **typed tokens** where each token has both a *value* and a *type*:

- **Value Tokens**: Discretized parameters (e.g., wall corner position)
- **Type Tokens**: Indicate semantic role (e.g., `MAKE_WALL_A_X`, `COMMAND`, `PART`, `STOP`)

**Embeddings**:

embedding = position_emb + value_emb + type_emb


**Autoregressive Generation**:

1. Start with `<START>` token
2. For each timestep $t$:
   - Feed sequence $\{\text{token}_1, ..., \text{token}_t\}$ to decoder
   - Attend to encoded point cloud features (cross-attention)
   - Predict next token value via softmax over $N_{bins} + 6$ special tokens
   - Decode type of next token based on grammar rules
3. Generate until `<STOP>` token

**Type-Guided Decoding**:

- Entity generation follows strict grammar:

→ param_1 → param_2 → … → param_n → ```

  • Type token determines valid next parameters (e.g., after make_wall expect a_x, a_y, …)
  • Enforces structural consistency during generation

Decoder Details:

  • 8 transformer decoder layers (fixed architecture)
  • 8 attention heads for multi-head attention
  • Feature dimension: 512 (\(d_{model} = 512\))
  • Parameters: ~35M optimizable parameters
  • Vocabulary size: 2048 tokens (also max sequence length)
  • Causal self-attention mask for autoregressive generation
  • Cross-attention to encoded point cloud features
Figure 2: SceneScript Diagram

1.2.0.3 Entity Types and Parameters

SceneScript defines four primitive types with specific parameter sets:

1.2.0.3.1 Wall Entity
PARAMS = {
    'id': int,           # Unique identifier
    'a_x', 'a_y', 'a_z': float,  # Corner A (meters)
    'b_x', 'b_y', 'b_z': float,  # Corner B (meters)
    'height': float,     # Wall height (meters)
    'thickness': float   # Always 0.0 (legacy)
}
  • Defines 3D line segment with vertical extrusion
  • Implicitly defines floor/ceiling at z=0 and z=height
1.2.0.3.2 Door Entity
PARAMS = {
    'id': int,
    'wall0_id': int,     # Parent wall reference
    'wall1_id': int,     # (duplicate for legacy)
    'position_x', 'position_y', 'position_z': float,  # Center
    'width': float,      # Door width
    'height': float      # Door height
}
  • Attached to parent wall
  • Oriented parallel to wall
1.2.0.3.3 Window Entity
  • Inherits same parameters as Door
  • Differentiated only by command type
1.2.0.3.4 Bbox Entity (Objects)
PARAMS = {
    'id': int,
    'class': str,        # Object category (e.g., 'chair', 'table')
    'position_x', 'position_y', 'position_z': float,  # Center
    'angle_z': float,    # Rotation around Z-axis (radians)
    'scale_x', 'scale_y', 'scale_z': float  # Bounding box size
}
  • Represents furniture and objects
  • Oriented bounding box with Z-axis rotation

1.2.0.4 Input/Output Formulation

Input:

  • Point Cloud: \(\mathbf{P} \in \mathbb{R}^{N_p \times 3}\) from MPS semi-dense SLAM
  • Discretization: Points discretized to 5cm resolution
  • Features: XYZ coordinates serve as input features

Output:

  • Token Sequence: \(\{\text{tok}_1, \text{tok}_2, ..., \text{tok}_T\}\) where \(T \leq T_{max} = 2048\)

  • Structured Language: Each entity encoded as:

    <PART> <CMD> <param_1> <param_2> ... <param_n> <PART> ...
  • Example:

    <PART> make_wall 42 156 200 0 198 312 200 0 255 0 <PART>

    (Discretized integer tokens, later undiscretized to continuous floats)

Tokenization:

  • Integer parameters: \(t = \text{int}(x)\)
  • Float parameters: \(t = \text{round}(x / \text{res})\) where res = 5cm resolution
  • Vocabulary: 2048 tokens maximum
  • Not BPE-based (unlike NLP); custom discretization scheme

Post-Processing:

  1. Parse token sequence into entities
  2. Undiscretize parameters back to continuous values
  3. Assign doors/windows to nearest wall
  4. Translate back to original coordinate frame

1.2.0.5 Training Methodology

Dataset: 100K ASE synthetic scenes

  • Train/Val/Test: 80K / 10K / 10K split
  • Augmentation: Random rotations, translations, point subsampling

Loss Function:

  • Cross-Entropy: On discretized token predictions
  • Type-Aware: Separate losses for each token type
  • Weighted: Higher weight on command tokens vs parameters

Training Strategy:

  • Teacher Forcing: During training, feed ground truth previous tokens
  • Nucleus Sampling: At inference, use nucleus sampling (top-p) for diversity
  • Greedy Decoding: Quantitative results decoded greedily
  • Augmentation: Random z-axis rotation (360°), random point subsampling (up to 500K points)

Optimization:

  • Optimizer: AdamW with learning rate 1e-4 (10^-3 for image-only encoder variant)
  • Batch Size: 64 scenes (effective batch size, distributed across multiple nodes)
  • Training Time: ~3-4 days (hardware not specified in paper)
  • Convergence: ~200K iterations
  • Loss: Standard cross-entropy on next token prediction

1.2.1 Relevance to Our Work

SceneScript provides a semantic backbone for NBV planning:

  1. Entity-Level Primitives: Explicit geometric hulls enable per-entity RRI computation
  2. Streaming Capability: Autoregressive generation allows incremental scene updates as more views are acquired
  3. Structured Representation: Entity parameters provide targets for reconstruction quality metrics
  4. Semantic Awareness: Different entity types (walls vs doors vs objects) can be weighted differently for NBV

Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can:

  • Compute RRI separately for each wall, door, or object
  • Prioritize views that improve reconstruction of specific entities
  • Integrate user-specified importance weights per entity type
  • Track reconstruction completeness at entity-level granularity

NBV Integration Strategy:

  1. Run SceneScript on partial point cloud → get predicted entities
  2. Compare predicted vs (incrementally estimated) ground truth entities
  3. Compute per-entity reconstruction error
  4. Candidate view RRI = expected reduction in entity errors visible from that view
  5. Select view maximizing weighted sum of entity-level RRI

SceneScript provides:

  • Semantic Scene Understanding: Explicit representation of walls, doors, windows, and objects
  • Entity-Level Primitives: Geometric hulls that can be used for entity-specific RRI computation
  • Streaming Capability: Autoregressive generation allows incremental scene updates
  • Backbone for NBV: Potential to serve as a semantic encoder for NBV prediction

Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can compute RRI separately for each room, wall, or object of interest.

1.2.2 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

[1]

This recent work extends SceneScript with interactive refinement capabilities, enabling human-in-the-loop corrections.

1.2.2.1 Key Contributions

  • Multi-Task SceneScript: Jointly trains global scene prediction and local infilling tasks
  • Interactive Refinement: Users can click erroneous regions for localized re-generation
  • Improved Local Accuracy: Significantly better reconstruction quality in user-specified regions
  • Beyond Training Distribution: Enables layouts that diverge from training data through iterative refinement

1.2.2.2 Human-in-the-Loop Workflow

  1. Initial Prediction: Generate full scene layout from point cloud
  2. User Feedback: Identify regions requiring correction
  3. Local Infilling: Re-generate only the problematic area while maintaining global consistency
  4. Iterative Refinement: Repeat until satisfactory

1.2.2.3 Relevance to Our Work

This work demonstrates:

  • Interactive Scene Understanding: Users can specify regions/entities of interest
  • Partial Updates: Ability to refine specific parts without full re-processing
  • Quality-Aware Refinement: System can focus computational resources where needed

Application to NBV: We can extend this paradigm to NBV planning where users select entities requiring higher reconstruction quality, and the system prioritizes views that improve those specific regions.

References

[1]
C. Xie et al., “Human-in-the-loop local corrections of 3D scene layouts via infilling.” 2025. Available: https://arxiv.org/abs/2503.11806