Image-to-3D
English
wala
depth-map-to-3d
File size: 6,009 Bytes
a81793a
 
 
 
 
 
 
 
4cd3cec
a81793a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d24c1ee
a81793a
 
 
d24c1ee
a81793a
edbb500
d24c1ee
a81793a
 
 
 
 
d24c1ee
a81793a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d24c1ee
a81793a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d24c1ee
a81793a
 
2dffd81
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- en
license: other
license_name: autodesk-non-commercial-3d-generative-v1.0
tags:
- wala
- depth-map-to-3d
pipeline_tag: image-to-3d
---

# Model Card for WaLa-DM6-1B

This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating high-quality 3D shapes from six-view depth map inputs with detailed geometry and complex structures.

## Model Details

### Model Description

WaLa-DM6-1B is a large-scale 3D generative model trained on a massive dataset of over 10 million publicly-available 3D shapes. It can efficiently generate a wide range of high-quality 3D shapes from six-view depth map inputs in just 4 seconds. The model uses a wavelet-based compact latent encoding and a billion-parameter architecture to achieve superior performance in terms of geometric detail and structural plausibility.

- **Developed by:** Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
- **Model type:** 3D Generative Model
- **License:** Autodesk Non-Commercial (3D Generative) v1.0

For more information please look at the [Project Page](https://autodeskailab.github.io/WaLaProject) and [the paper](TBD).

### Model Sources

- **Project Page:** [WaLa](https://autodeskailab.github.io/WaLaProject)
- **Repository:** [Github](https://github.com/AutodeskAILab/WaLa)
- **Paper:** [ArXiv](https://arxiv.org/abs/2411.08017)
- **Demo:** [Colab](https://colab.research.google.com/drive/1W5zPXw9xWNpLTlU5rnq7g3jtIA2BX6aC?usp=sharing)

## Uses

### Direct Use

This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. Please see [here](https://github.com/AutodeskAILab/WaLa?tab=readme-ov-file#depth-map-to-3d) for inferencing instructions. 

### Out-of-Scope Use

The model should not be used for:

- Commercial purposes 

- Creation of load-bearing physical objects the failure of which could cause property damage or personal injury 

- Any usage not in compliance with the [license](https://huggingface.co/ADSKAILab/WaLa-DM6-1B/blob/main/LICENSE.md), in particular, the "Acceptable Use" section. 

## Bias, Risks, and Limitations

### Bias

- The model may inherit biases present in the publicly-available training datasets, which could lead to uneven representation of certain object types or styles. 

- The model's performance may degrade for object categories or styles that are underrepresented in the training data. 

### Risks and Limitations

- The quality of the generated 3D output may be impacted by the quality and accuracy of the input depth maps.
- The model may occasionally generate implausible shapes, especially when the input depth maps are ambiguous or of low quality. Even theoretically plausible shapes should not be relied upon for real-world structural soundness.

## How to Get Started with the Model

Please refer to the instructions [here](https://github.com/AutodeskAILab/WaLa?tab=readme-ov-file#getting-started)

## Training Details

### Training Data

The model was trained on a dataset of over 10 million 3D shapes aggregated from 19 different publicly-available sub-datasets, including ModelNet, ShapeNet, SMLP, Thingi10K, SMAL, COMA, House3D, ABC, Fusion 360, 3D-FUTURE, BuildingNet, DeformingThings4D, FG3D, Toys4K, ABO, Infinigen, Objaverse, and two subsets of ObjaverseXL (Thingiverse and GitHub).

### Training Procedure

#### Preprocessing

Each 3D shape in the dataset was converted into a truncated signed distance function (TSDF) with a resolution of 256³. The TSDF was then decomposed using a discrete wavelet transform to create the wavelet-tree representation used by the model. For depth map conditioning, six views were selected from pre-selected viewpoints to ensure comprehensive coverage of the entire object.

#### Training Hyperparameters

- **Training regime:** Please refer to the paper.

#### Speeds, Sizes, Times

- The model contains approximately 956 million parameters.
- The model can generate shapes within 4 seconds.

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model was evaluated on the Google Scanned Objects (GSO) dataset and a validation set from the training data (MAS validation data).

#### Factors

The evaluation considered various factors such as the quality of generated shapes, the ability to capture fine details and complex structures, and the model's performance across different object categories.

#### Metrics

The model was evaluated using the following metrics:
- Intersection over Union (IoU)
- Light Field Distance (LFD)
- Chamfer Distance (CD)

### Results

The multi-view (Depth 6) to 3D model achieved the following results on the GSO dataset:
- LFD: 1122.61
- IoU: 0.91245
- CD: 0.00125

On the MAS validation dataset:
- LFD: 1358.82
- IoU: 0.85986
- CD: 0.00129

## Technical Specifications

### Model Architecture and Objective

The model uses a U-ViT architecture with modifications. It employs a wavelet-based compact latent encoding to effectively capture both coarse and fine details of 3D shapes from multi-view depth inputs. Each selected depth map is processed individually through the DINO v2 encoder, generating a sequence of latent vectors for each view. The latent vectors from all six views are concatenated to form the final conditional latent vectors.

### Compute Infrastructure

#### Hardware

The model was trained on NVIDIA H100 GPUs.

## Citation
```
@misc{sanghi2024waveletlatentdiffusionwala,
      title={Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings}, 
      author={Aditya Sanghi and Aliasghar Khani and Pradyumna Reddy and Arianna Rampini and Derek Cheung and Kamal Rahimi Malekshan and Kanika Madan and Hooman Shayani},
      year={2024},
      eprint={2411.08017},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.08017}, 
}
```