difix_ref / README.md

Update README.md

62c93da verified about 2 months ago

11.6 kB

	---
	datasets:
	- DL3DV/DL3DV-10K-Sample
	language:
	- en
	---
	# Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
	CVPR 2025 (Oral)
	[Code](https://github.com/nv-tlabs/Difix3D) \| [Project Page](https://research.nvidia.com/labs/toronto-ai/difix3d/) \| [Paper](https://arxiv.org/abs/2503.01774)

	## Description:
	Difix is a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by
	underconstrained regions of 3D representation. The technology behind Difix is based on the concepts outlined in the paper titled
	[DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774 ).

	Difix has two operation modes:

	* Offline mode: Used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction
	and then distill them back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality.
	* Online mode: Acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D
	supervision and the limited capacity of current reconstruction models.

	Difix is an all-encompassing solution, a single model compatible for both NeRF and 3DGS representations.

	This model is ready for research and development/non-commercial use only.

	Model Developer: NVIDIA

	Model Versions: difix_ref

	Deployment Geography: Global

	### License/Terms of Use:
	The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md)


	### Use Case:
	Difix is intended for Physical AI developers looking to enhance and improve their Neural Reconstruction pipelines. The model takes an image as an input and outputs a fixed image

	Release Date: Github: [June 2025](https://github.com/nv-tlabs/Difix3D)

	## Model Architecture

	Architecture Type: UNet

	Network Architecture: A latent diffusion-based UNet coupled with a variational autoencoder (VAE).

	## Input

	Input Type(s): Image

	Input Format(s): Red, Green, Blue (RGB)

	Input Parameters: Two-Dimensional (2D)

	Other Properties Related to Input:
	* Specific Resolution: [576px x 1024px]

	## Output

	Output Type(s): Image

	Output Format(s): Red, Green, Blue (RGB)

	Output Parameters: Two-Dimensional (2D)

	Other Properties Related to Output:
	* Specific Resolution: [576px x 1024px]

	## Software Integration

	Runtime Engine(s): PyTorch

	Supported Hardware Microarchitecture Compatibility:
	* NVIDIA Ampere
	* NVIDIA Hopper

	Note: We are testing with FP32 Precision.

	## Inference
	Acceleration Engine: [PyTorch](https://pytorch.org/)

	Test Hardware:
	* A100
	* H100

	Operating System(s): Linux (We have not tested on other operating systems.)

	System Requirements and Performance:
	This model requires X GB of GPU VRAM.
	The following table shows inference time for a single generation across different NVIDIA GPU hardware:

	\| GPU Hardware \| Inference Runtime \|
	\|--------------\|----------------------------\|
	\| NVIDIA A100 \| 0.355 sec \|
	\| NVIDIA H100 \| 0.223 sec \|

	## Use the Difix Model
	Please visit the [Difix3D repository](https://github.com/nv-tlabs/Difix3D) to access all relevant files and code needed to use Difix


	## Difix Dataset
	- Data Collection Method: Human
	- Labeling Method by Dataset: Human
	- Properties: Difix was trained, tested, and evaluated using the [DL3DV-10k dataset](https://huggingface.co/datasets/DL3DV/DL3DV-10K-Sample), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. DL3DV-10K is a large-scale dataset consisting of 10,510 high-resolution (4K) real-world video sequences, totaling approximately 51.2 million frames. The scenes span 65 diverse categories across indoor and outdoor environments. Each video is accompanied by metadata describing environmental conditions such as lighting (natural, artificial, mixed), surface materials (e.g., reflective or transparent), and texture complexity. The dataset is designed to support the development and evaluation of learning-based 3D vision methods.


	## Ethical Considerations:
	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

	Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)

	---

	## ModelCard++

	### Bias

	\| Field \| Response \|
	\| :--------------------------------------------------------------------------------------------------------------------------------------------------------------- \| :------- \|
	\| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: \| None \|
	\| Measures taken to mitigate against unwanted bias: \| None \|

	### Explainability

	\| Field \| Response \|
	\| :-------------------------------------------------------- \| :------------------------------------------------------------------------------------------------------------------- \|
	\| Intended Domain: \| Advanced Driver Assistance Systems \|
	\| Model Type: \| Image-to-Image \|
	\| Intended Users: \| Autonomous Vehicles developers enhancing and improving Neural Reconstruction pipelines. \|
	\| Output: \| Image \|
	\| Describe how the model works: \| The model takes as an input an image, and outputs a fixed image \|
	\| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: \| None \|
	\| Technical Limitations: \| The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output. \|
	\| Verified to have met prescribed NVIDIA quality standards: \| Yes \|
	\| Performance Metrics: \| FID (Fréchet Inception Distance), PSNR (Peak Signal-to-Noise Ratio), LPIPS (Learned Perceptual Image Patch Similarity) \|
	\| Potential Known Risks: \| The model is not guaranteed to fix 100% of the image artifacts. please verify the generated scenarios are context and use appropriate. \|
	\| Licensing: \| The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md). \|

	### Privacy

	\| Field \| Response \|
	\| :------------------------------------------------------------------ \| :------------- \|
	\| Generatable or reverse engineerable personal data? \| No \|
	\| Personal data used to create this model? \| No \|
	\| How often is the dataset reviewed? \| Before release \|
	\| Is there provenance for all datasets used in training? \| Yes \|
	\| Does data labeling (annotation, metadata) comply with privacy laws? \| Yes \|
	\| Is data compliant with data subject requests for data correction or removal, if such a request was made? \| Yes \|

	### Safety & Security

	\| Field \| Response \|
	\| :---------------------------------------------- \| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Model Application(s): \| Image Enhancement \|
	\| List types of specific high-risk AI systems, if any, in which the model can be integrated: \| The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Difix model should not be deployed in a vehicle. \|
	\| Describe the life critical impact (if present). \| N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks. \|
	\| Use Case Restrictions: \| Your use of the model and code is governed by the NVIDIA License. Additional Information: LICENSE.md · stabilityai/sd-turbo at main \|
	\| Model and dataset restrictions: \| The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. \|

	---
	datasets:
	- DL3DV/DL3DV-10K-Sample
	language:
	- en
	---
	# Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
	CVPR 2025 (Oral)
	[Code](https://github.com/nv-tlabs/Difix3D) \| [Project Page](https://research.nvidia.com/labs/toronto-ai/difix3d/) \| [Paper](https://arxiv.org/abs/2503.01774)

	## Description:
	Difix is a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by
	underconstrained regions of 3D representation. The technology behind Difix is based on the concepts outlined in the paper titled
	[DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774 ).

	Difix has two operation modes:

	* Offline mode: Used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction
	and then distill them back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality.
	* Online mode: Acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D
	supervision and the limited capacity of current reconstruction models.

	Difix is an all-encompassing solution, a single model compatible for both NeRF and 3DGS representations.

	This model is ready for research and development/non-commercial use only.

	Model Developer: NVIDIA

	Model Versions: difix_ref

	Deployment Geography: Global

	### License/Terms of Use:
	The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md)


	### Use Case:
	Difix is intended for Physical AI developers looking to enhance and improve their Neural Reconstruction pipelines. The model takes an image as an input and outputs a fixed image

	Release Date: Github: [June 2025](https://github.com/nv-tlabs/Difix3D)

	## Model Architecture

	Architecture Type: UNet

	Network Architecture: A latent diffusion-based UNet coupled with a variational autoencoder (VAE).

	## Input

	Input Type(s): Image

	Input Format(s): Red, Green, Blue (RGB)

	Input Parameters: Two-Dimensional (2D)

	Other Properties Related to Input:
	* Specific Resolution: [576px x 1024px]

	## Output

	Output Type(s): Image

	Output Format(s): Red, Green, Blue (RGB)

	Output Parameters: Two-Dimensional (2D)

	Other Properties Related to Output:
	* Specific Resolution: [576px x 1024px]

	## Software Integration

	Runtime Engine(s): PyTorch

	Supported Hardware Microarchitecture Compatibility:
	* NVIDIA Ampere
	* NVIDIA Hopper

	Note: We are testing with FP32 Precision.

	## Inference
	Acceleration Engine: [PyTorch](https://pytorch.org/)

	Test Hardware:
	* A100
	* H100

	Operating System(s): Linux (We have not tested on other operating systems.)

	System Requirements and Performance:
	This model requires X GB of GPU VRAM.
	The following table shows inference time for a single generation across different NVIDIA GPU hardware:

	\| GPU Hardware \| Inference Runtime \|
	\|--------------\|----------------------------\|
	\| NVIDIA A100 \| 0.355 sec \|
	\| NVIDIA H100 \| 0.223 sec \|

	## Use the Difix Model
	Please visit the [Difix3D repository](https://github.com/nv-tlabs/Difix3D) to access all relevant files and code needed to use Difix


	## Difix Dataset
	- Data Collection Method: Human
	- Labeling Method by Dataset: Human
	- Properties: Difix was trained, tested, and evaluated using the [DL3DV-10k dataset](https://huggingface.co/datasets/DL3DV/DL3DV-10K-Sample), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. DL3DV-10K is a large-scale dataset consisting of 10,510 high-resolution (4K) real-world video sequences, totaling approximately 51.2 million frames. The scenes span 65 diverse categories across indoor and outdoor environments. Each video is accompanied by metadata describing environmental conditions such as lighting (natural, artificial, mixed), surface materials (e.g., reflective or transparent), and texture complexity. The dataset is designed to support the development and evaluation of learning-based 3D vision methods.


	## Ethical Considerations:
	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

	Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)

	---

	## ModelCard++

	### Bias

	\| Field \| Response \|
	\| :--------------------------------------------------------------------------------------------------------------------------------------------------------------- \| :------- \|
	\| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: \| None \|
	\| Measures taken to mitigate against unwanted bias: \| None \|

	### Explainability

	\| Field \| Response \|
	\| :-------------------------------------------------------- \| :------------------------------------------------------------------------------------------------------------------- \|
	\| Intended Domain: \| Advanced Driver Assistance Systems \|
	\| Model Type: \| Image-to-Image \|
	\| Intended Users: \| Autonomous Vehicles developers enhancing and improving Neural Reconstruction pipelines. \|
	\| Output: \| Image \|
	\| Describe how the model works: \| The model takes as an input an image, and outputs a fixed image \|
	\| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: \| None \|
	\| Technical Limitations: \| The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output. \|
	\| Verified to have met prescribed NVIDIA quality standards: \| Yes \|
	\| Performance Metrics: \| FID (Fréchet Inception Distance), PSNR (Peak Signal-to-Noise Ratio), LPIPS (Learned Perceptual Image Patch Similarity) \|
	\| Potential Known Risks: \| The model is not guaranteed to fix 100% of the image artifacts. please verify the generated scenarios are context and use appropriate. \|
	\| Licensing: \| The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md). \|

	### Privacy

	\| Field \| Response \|
	\| :------------------------------------------------------------------ \| :------------- \|
	\| Generatable or reverse engineerable personal data? \| No \|
	\| Personal data used to create this model? \| No \|
	\| How often is the dataset reviewed? \| Before release \|
	\| Is there provenance for all datasets used in training? \| Yes \|
	\| Does data labeling (annotation, metadata) comply with privacy laws? \| Yes \|
	\| Is data compliant with data subject requests for data correction or removal, if such a request was made? \| Yes \|

	### Safety & Security

	\| Field \| Response \|
	\| :---------------------------------------------- \| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Model Application(s): \| Image Enhancement \|
	\| List types of specific high-risk AI systems, if any, in which the model can be integrated: \| The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Difix model should not be deployed in a vehicle. \|
	\| Describe the life critical impact (if present). \| N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks. \|
	\| Use Case Restrictions: \| Your use of the model and code is governed by the NVIDIA License. Additional Information: LICENSE.md · stabilityai/sd-turbo at main \|
	\| Model and dataset restrictions: \| The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. \|