language:
- en
license: other
license_name: autodesk-non-commercial-3d-generative-v1.0
tags:
- wala
- voxel-to-3d
Model Card for WaLa-VX16-1B
This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating high-quality 3D shapes from 16³ voxel inputs with detailed geometry and complex structures.
Model Details
Model Description
WaLa-VX16-1B is a large-scale 3D generative model trained on a massive dataset of over 10 million publicly-available 3D shapes. It can efficiently generate a wide range of high-quality 3D shapes from 16³ voxel inputs in just 2-4 seconds. The model uses a wavelet-based compact latent encoding and a billion-parameter architecture to achieve superior performance in terms of geometric detail and structural plausibility.
- Developed by: Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
- Model type: 3D Generative Model
- License: Autodesk Non-Commercial (3D Generative) v1.0
For more information please look at the Project Page and the paper.
Model Sources
Uses
Direct Use
This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. Please see here for inferencing instructions.
Out-of-Scope Use
The model should not be used for:
Commercial purposes
Creation of load-bearing physical objects the failure of which could cause property damage or personal injury
Any usage not in compliance with the license, in particular, the "Acceptable Use" section.
Bias, Risks, and Limitations
Bias
The model may inherit biases present in the publicly-available training datasets, which could lead to uneven representation of certain object types or styles.
The model's performance may degrade for object categories or styles that are underrepresented in the training data.
Risks and Limitations
- The quality of the generated 3D output may be impacted by the low resolution of the input voxel grid.
- The model may occasionally generate implausible shapes, especially when the input voxel grid is ambiguous or lacks detail. Even theoretically plausible shapes should not be relied upon for real-world structural soundness.
How to Get Started with the Model
Please refer to the instructions here
Training Details
Training Data
The model was trained on a dataset of over 10 million 3D shapes aggregated from 19 different publicly-available sub-datasets, including ModelNet, ShapeNet, SMLP, Thingi10K, SMAL, COMA, House3D, ABC, Fusion 360, 3D-FUTURE, BuildingNet, DeformingThings4D, FG3D, Toys4K, ABO, Infinigen, Objaverse, and two subsets of ObjaverseXL (Thingiverse and GitHub).
Training Procedure
Preprocessing
Each 3D shape in the dataset was converted into a truncated signed distance function (TSDF) with a resolution of 256³. The TSDF was then decomposed using a discrete wavelet transform to create the wavelet-tree representation used by the model. For voxel conditioning, each shape was downsampled to a 16³ voxel grid.
Training Hyperparameters
- Training regime: Please refer to the paper.
Speeds, Sizes, Times
- The model contains approximately 906.9 million parameters.
- The model can generate shapes within 2-4 seconds.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on the Google Scanned Objects (GSO) dataset and a validation set from the training data (MAS validation data).
Factors
The evaluation considered various factors such as the quality of generated shapes, the ability to capture fine details and complex structures, and the model's performance across different object categories.
Metrics
The model was evaluated using the following metrics:
- Intersection over Union (IoU)
- Light Field Distance (LFD)
- Chamfer Distance (CD)
Results
The voxel to 3D model achieved the following results on the GSO dataset:
- LFD: 1544.67
- IoU: 0.8285
- CD: 0.0020
On the MAS validation dataset:
- LFD: 1874.41
- IoU: 0.75739
- CD: 0.0020
Technical Specifications
Model Architecture and Objective
The model uses a U-ViT architecture with modifications. It employs a wavelet-based compact latent encoding to effectively capture both coarse and fine details of 3D shapes from voxel inputs. The input 16³ voxel grid is processed through a ResNet-based convolutional encoder, which is then downsampled to produce the condition latent vectors.
Compute Infrastructure
Hardware
[TBD]
Citation
[Citation information to be added after paper publication]