InstantX
/

Qwen-Image-ControlNet-Union

+---
+license: apache-2.0
+language:
+- en
+library_name: diffusers
+pipeline_tag: image-to-image
+tags:
+- Image-to-Image
+- ControlNet
+- Diffusers
+- QwenImageControlNetPipeline
+- Qwen-Image
+base_model: Qwen/Qwen-Image
+---
+# Qwen-Image-ControlNet-Union
+This repository provides a unified ControlNet that supports 4 control types (canny, soft edge, depth, pose) for [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image).
+# Model Cards
+- This ControlNet consists of 5 double blocks copied from the pretrained transformer layers.
+- We train the model from scratch for 50K steps using a dataset of 10M high-quality general and human images.
+- We train at 1328x1328 resolution in BFloat16, batch size=64, learning rate=4e-5. We set the text drop ratio to 0.10.
+- This model supports multiple control modes, including canny, soft edge, depth, pose. You can use it just as a normal ControlNet.
+# Showcases
+<table style="width:100%; table-layout:fixed;">
+  <tr>
+    <td><img src="./conds/canny2.png"   alt="canny"></td>
+    <td><img src="./outputs/canny2.png" alt="softedge"></td>
+  </tr>
+  <tr>
+    <td><img src="./conds/soft_edge.png"    alt="pose"></td>
+    <td><img src="./outputs/soft_edge.png"   alt="depth"></td>
+  </tr>
+  <tr>
+    <td><img src="./conds/depth.png"    alt="pose"></td>
+    <td><img src="./outputs/depth.png"   alt="depth"></td>
+  </tr>
+  <tr>
+    <td><img src="./conds/pose.png"    alt="pose"></td>
+    <td><img src="./outputs/pose.png"   alt="depth"></td>
+  </tr>
+</table>
+# Inference
+```python
+import torch
+from diffusers.utils import load_image
+# before merging, please import via local path
+from controlnet_qwenimage import QwenImageControlNetModel
+from transformer_qwenimage import QwenImageTransformer2DModel
+from pipeline_qwenimage_controlnet import QwenImageControlNetPipeline
+base_model = "Qwen/Qwen-Image"
+controlnet_model = "InstantX/Qwen-Image-ControlNet-Union"
+controlnet = QwenImageControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)
+transformer = QwenImageTransformer2DModel.from_pretrained(base_model, subfolder="transformer", torch_dtype=torch.bfloat16)
+pipe = QwenImageControlNetPipeline.from_pretrained(
+    base_model, controlnet=controlnet, transformer=transformer, torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+# canny
+# it is highly suggested to add 'TEXT' into prompt if there are text elements
+control_image = load_image("conds/canny.png")
+prompt = "Aesthetics art, traditional asian pagoda, elaborate golden accents, sky blue and white color palette, swirling cloud pattern, digital illustration, east asian architecture, ornamental rooftop, intricate detailing on building, cultural representation."
+controlnet_conditioning_scale = 1.0
+# soft edge, recommended scale: 0.8 - 1.0
+# control_image = load_image("conds/soft_edge.png")
+# prompt = "Photograph of a young man with light brown hair jumping mid-air off a large, reddish-brown rock. He's wearing a navy blue sweater, light blue shirt, gray pants, and brown shoes. His arms are outstretched, and he has a slight smile on his face. The background features a cloudy sky and a distant, leafless tree line. The grass around the rock is patchy."
+# controlnet_conditioning_scale = 0.9
+# depth
+# control_image = load_image("conds/depth.png")
+# prompt = "A swanky, minimalist living room with a huge floor-to-ceiling window letting in loads of natural light. A beige couch with white cushions sits on a wooden floor, with a matching coffee table in front. The walls are a soft, warm beige, decorated with two framed botanical prints. A potted plant chills in the corner near the window. Sunlight pours through the leaves outside, casting cool shadows on the floor."
+# controlnet_conditioning_scale = 0.9
+# pose
+# control_image = load_image("conds/pose.png")
+# prompt = "Photograph of a young man with light brown hair and a beard, wearing a beige flat cap, black leather jacket, gray shirt, brown pants, and white sneakers. He's sitting on a concrete ledge in front of a large circular window, with a cityscape reflected in the glass. The wall is cream-colored, and the sky is clear blue. His shadow is cast on the wall."
+# controlnet_conditioning_scale = 1.0
+image = pipe(
+    prompt=prompt,
+    negative_prompt=" ",
+    control_image=control_image,
+    controlnet_conditioning_scale=controlnet_conditioning_scale,
+    width=control_image.size[0],
+    height=control_image.size[1],
+    num_inference_steps=30,
+    true_cfg_scale=4.0,
+    generator=torch.Generator(device="cuda").manual_seed(42),
+).images[0]
+image.save(f"qwenimage_cn_union_result.png")
+```
+# Recommended Parameters
+You can adjust control strength via controlnet_conditioning_scale.
+- Canny: use cv2.Canny, set controlnet_conditioning_scale in [0.8, 1.0]
+- Soft Edge: use [AnylineDetector](https://github.com/huggingface/controlnet_aux), set controlnet_conditioning_scale in [0.8, 1.0]
+- Depth: use [depth-anything](https://github.com/DepthAnything/Depth-Anything-V2), set controlnet_conditioning_scale in [0.8, 1.0]
+- Pose: use [DWPose](https://github.com/IDEA-Research/DWPose/tree/onnx), set controlnet_conditioning_scale in [0.8, 1.0]
+We strongly recommend using detailed prompts, especially when include text elements. For example, use prompt "A poster with a wilderness scene in the background. In the lower right corner, it says 'InstantX Team. All copyright reserved.'' The headlines are 'Qwen-Image' and 'ControlNet-Union', and the date is '2025.8'." instead of "a poster".
+# Limitations
+We find that the model was unable to preserve some details, such as small font text.
+# Acknowledgements
+This model is developed by InstantX Team. All copyright reserved.