1 SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
SceneScript@SceneScript-avetisyan2024 represents a paradigm shift in scene understanding by using structured language to describe complete 3D layouts.
1.1 Local Sources
main.tex– ECCV manuscript entry point.sections/dataset.tex– ASE dataset specification used for training SceneScript.sections/structured_scene_language.tex– command grammar & token definitions.figs/Main-Figure.pdf– canonical teaser referenced in this page.supp.tex– supplemental experiments and qualitative visualizations.
1.2 Key Contributions
- Structured Scene Language: Represents scenes as sequences of language-like commands (walls, doors, windows, objects)
- Autoregressive Transformer: Generates scene descriptions token-by-token, enabling streaming and partial updates
- Aria Synthetic Environments (ASE): Introduces 100K synthetic indoor scenes for training
- Entity-Level Representation: Provides explicit geometric primitives (hulls, bounding boxes) for architectural elements
1.2.0.1 Scene Representation
The SceneScript language encodes:
```plaintext
make_wall, id=0, a_x=-2.56, a_y=6.16, a_z=0.0, b_x=5.07, b_y=6.16, b_z=0.0, height=3.26, thickness=0.0
make_door, id=1000, wall0_id=2, wall1_id=4, position_x=-1.51, position_y=1.84, position_z=1.01, width=1.82, height=2.02
make_window, id=2000, wall0_id=0, position_x=4.45, position_y=6.16, position_z=1.64, width=1.01, height=2.12
1.2.0.2 Architecture
#### Architecture Details
SceneScript uses a two-stage architecture combining geometric encoding with autoregressive structured language generation:
##### 1. Sparse 3D ResNet Encoder
**Input Processing**:
- **Point Cloud Discretization**: Input point clouds discretized to **5cm resolution**
- **Sparse Representation**: Uses `torchsparse` for efficient handling of sparse 3D data
- **Preprocessing**: Points normalized and voxelized
**Architecture**:
- **ResNet-Style Encoder**: Sparse 3D ResNet using sparse convolutions
- **Downsampling**: **5 down convolutional layers** with kernel size 3 and stride 2
- **Point Reduction**: Reduces number of active sites by ~1000×
- **Parameters**: ~20M optimizable parameters
**Output**: Sparse feature tensor → converted to sequence of features $\mathbf{F} \in \mathbb{R}^{N \times d_{model}}$ where $N$ is number of non-empty voxels (much smaller than input)
**Coordinate Encoding**:
- Active site coordinates appended to feature vectors: $\mathbf{f}_i \leftarrow \text{cat}(\mathbf{f_i}, \mathbf{c_i})$
- Features sorted lexicographically by coordinate
- Provides positional information for transformer decoder
##### 2. Transformer Decoder
**Token Types**:
SceneScript uses **typed tokens** where each token has both a *value* and a *type*:
- **Value Tokens**: Discretized parameters (e.g., wall corner position)
- **Type Tokens**: Indicate semantic role (e.g., `MAKE_WALL_A_X`, `COMMAND`, `PART`, `STOP`)
**Embeddings**:
embedding = position_emb + value_emb + type_emb
**Autoregressive Generation**:
1. Start with `<START>` token
2. For each timestep $t$:
- Feed sequence $\{\text{token}_1, ..., \text{token}_t\}$ to decoder
- Attend to encoded point cloud features (cross-attention)
- Predict next token value via softmax over $N_{bins} + 6$ special tokens
- Decode type of next token based on grammar rules
3. Generate until `<STOP>` token
**Type-Guided Decoding**:
- Entity generation follows strict grammar:
- Type token determines valid next parameters (e.g., after
make_wallexpecta_x,a_y, …) - Enforces structural consistency during generation
Decoder Details:
- 8 transformer decoder layers (fixed architecture)
- 8 attention heads for multi-head attention
- Feature dimension: 512 (\(d_{model} = 512\))
- Parameters: ~35M optimizable parameters
- Vocabulary size: 2048 tokens (also max sequence length)
- Causal self-attention mask for autoregressive generation
- Cross-attention to encoded point cloud features
1.2.0.3 Entity Types and Parameters
SceneScript defines four primitive types with specific parameter sets:
1.2.0.3.1 Wall Entity
PARAMS = {
'id': int, # Unique identifier
'a_x', 'a_y', 'a_z': float, # Corner A (meters)
'b_x', 'b_y', 'b_z': float, # Corner B (meters)
'height': float, # Wall height (meters)
'thickness': float # Always 0.0 (legacy)
}- Defines 3D line segment with vertical extrusion
- Implicitly defines floor/ceiling at z=0 and z=height
1.2.0.3.2 Door Entity
PARAMS = {
'id': int,
'wall0_id': int, # Parent wall reference
'wall1_id': int, # (duplicate for legacy)
'position_x', 'position_y', 'position_z': float, # Center
'width': float, # Door width
'height': float # Door height
}- Attached to parent wall
- Oriented parallel to wall
1.2.0.3.3 Window Entity
- Inherits same parameters as Door
- Differentiated only by command type
1.2.0.3.4 Bbox Entity (Objects)
PARAMS = {
'id': int,
'class': str, # Object category (e.g., 'chair', 'table')
'position_x', 'position_y', 'position_z': float, # Center
'angle_z': float, # Rotation around Z-axis (radians)
'scale_x', 'scale_y', 'scale_z': float # Bounding box size
}- Represents furniture and objects
- Oriented bounding box with Z-axis rotation
1.2.0.4 Input/Output Formulation
Input:
- Point Cloud: \(\mathbf{P} \in \mathbb{R}^{N_p \times 3}\) from MPS semi-dense SLAM
- Discretization: Points discretized to 5cm resolution
- Features: XYZ coordinates serve as input features
Output:
Token Sequence: \(\{\text{tok}_1, \text{tok}_2, ..., \text{tok}_T\}\) where \(T \leq T_{max} = 2048\)
Structured Language: Each entity encoded as:
<PART> <CMD> <param_1> <param_2> ... <param_n> <PART> ...Example:
<PART> make_wall 42 156 200 0 198 312 200 0 255 0 <PART>(Discretized integer tokens, later undiscretized to continuous floats)
Tokenization:
- Integer parameters: \(t = \text{int}(x)\)
- Float parameters: \(t = \text{round}(x / \text{res})\) where res = 5cm resolution
- Vocabulary: 2048 tokens maximum
- Not BPE-based (unlike NLP); custom discretization scheme
Post-Processing:
- Parse token sequence into entities
- Undiscretize parameters back to continuous values
- Assign doors/windows to nearest wall
- Translate back to original coordinate frame
1.2.0.5 Training Methodology
Dataset: 100K ASE synthetic scenes
- Train/Val/Test: 80K / 10K / 10K split
- Augmentation: Random rotations, translations, point subsampling
Loss Function:
- Cross-Entropy: On discretized token predictions
- Type-Aware: Separate losses for each token type
- Weighted: Higher weight on command tokens vs parameters
Training Strategy:
- Teacher Forcing: During training, feed ground truth previous tokens
- Nucleus Sampling: At inference, use nucleus sampling (top-p) for diversity
- Greedy Decoding: Quantitative results decoded greedily
- Augmentation: Random z-axis rotation (360°), random point subsampling (up to 500K points)
Optimization:
- Optimizer: AdamW with learning rate 1e-4 (10^-3 for image-only encoder variant)
- Batch Size: 64 scenes (effective batch size, distributed across multiple nodes)
- Training Time: ~3-4 days (hardware not specified in paper)
- Convergence: ~200K iterations
- Loss: Standard cross-entropy on next token prediction
1.2.1 Relevance to Our Work
SceneScript provides a semantic backbone for NBV planning:
- Entity-Level Primitives: Explicit geometric hulls enable per-entity RRI computation
- Streaming Capability: Autoregressive generation allows incremental scene updates as more views are acquired
- Structured Representation: Entity parameters provide targets for reconstruction quality metrics
- Semantic Awareness: Different entity types (walls vs doors vs objects) can be weighted differently for NBV
Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can:
- Compute RRI separately for each wall, door, or object
- Prioritize views that improve reconstruction of specific entities
- Integrate user-specified importance weights per entity type
- Track reconstruction completeness at entity-level granularity
NBV Integration Strategy:
- Run SceneScript on partial point cloud → get predicted entities
- Compare predicted vs (incrementally estimated) ground truth entities
- Compute per-entity reconstruction error
- Candidate view RRI = expected reduction in entity errors visible from that view
- Select view maximizing weighted sum of entity-level RRI
SceneScript provides:
- Semantic Scene Understanding: Explicit representation of walls, doors, windows, and objects
- Entity-Level Primitives: Geometric hulls that can be used for entity-specific RRI computation
- Streaming Capability: Autoregressive generation allows incremental scene updates
- Backbone for NBV: Potential to serve as a semantic encoder for NBV prediction
Key Insight: SceneScript’s structured representation enables entity-aware NBV planning where we can compute RRI separately for each room, wall, or object of interest.
1.2.2 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling
This recent work extends SceneScript with interactive refinement capabilities, enabling human-in-the-loop corrections.
1.2.2.1 Key Contributions
- Multi-Task SceneScript: Jointly trains global scene prediction and local infilling tasks
- Interactive Refinement: Users can click erroneous regions for localized re-generation
- Improved Local Accuracy: Significantly better reconstruction quality in user-specified regions
- Beyond Training Distribution: Enables layouts that diverge from training data through iterative refinement
1.2.2.2 Human-in-the-Loop Workflow
- Initial Prediction: Generate full scene layout from point cloud
- User Feedback: Identify regions requiring correction
- Local Infilling: Re-generate only the problematic area while maintaining global consistency
- Iterative Refinement: Repeat until satisfactory
1.2.2.3 Relevance to Our Work
This work demonstrates:
- Interactive Scene Understanding: Users can specify regions/entities of interest
- Partial Updates: Ability to refine specific parts without full re-processing
- Quality-Aware Refinement: System can focus computational resources where needed
Application to NBV: We can extend this paradigm to NBV planning where users select entities requiring higher reconstruction quality, and the system prioritizes views that improve those specific regions.
