File size: 12,434 Bytes
58d8d48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
---
license: mit
---
# Implementation of FLUX-Text
FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
<a href='https://amap-ml.github.io/FLUX-text/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
<a href='https://arxiv.org/abs/2505.03329'><img src='https://img.shields.io/badge/Technique-Report-red'></a>
<a href="https://huggingface.co/GD-ML/FLUX-Text/"><img src="https://img.shields.io/badge/π€_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
<!-- <a ><img src="https://img.shields.io/badge/π€_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a> -->
> *[Rui Lan](https://scholar.google.com/citations?user=zwVlWXwAAAAJ&hl=zh-CN), [Yancheng Bai](https://scholar.google.com/citations?hl=zh-CN&user=Ilx8WNkAAAAJ&view_op=list_works&sortby=pubdate), [Xu Duan](https://scholar.google.com/citations?hl=zh-CN&user=EEUiFbwAAAAJ), [Mingxing Li](https://scholar.google.com/citations?hl=zh-CN&user=-pfkprkAAAAJ), [Lei Sun](https://allylei.github.io), [Xiangxiang Chu](https://scholar.google.com/citations?hl=zh-CN&user=jn21pUsAAAAJ&view_op=list_works&sortby=pubdate)*
> <br>
> ALibaba Group
<img src='assets/flux-text.png'>
## π Overview
* **Motivation:** Scene text editing is a challenging task that aims to modify or add text in images while maintaining the fidelity of newly generated text and visual coherence with the background. The main challenge of this task is that we need to edit multiple line texts with diverse language attributes (e.g., fonts, sizes, and styles), language types (e.g., English, Chinese), and visual scenarios (e.g., poster, advertising, gaming).
* **Contribution:** We propose FLUX-Text, a novel text editing framework for editing multi-line texts in complex visual scenes. By incorporating a lightweight Condition Injection LoRA module, Regional text perceptual loss, and two-stage training strategy, we significantly significant improvements on both Chinese and English benchmarks.
<img src='assets/method.png'>
## News
- **2025-07-16**: π₯ Update comfyui node. We have decoupled the FLUX-Text node to support the use of more basic nodes. Due to differences in node computation in ComfyUI, if you need more consistent results, you should set min_length to 512 in the [code](https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/text_encoders/flux.py#L12).
<div align="center">
<table>
<tr>
<td><img src="assets/comfyui2.png" alt="workflow/FLUX-Text-Basic-Workflow.json" width="400"/></td>
</tr>
<tr>
<td align="center">workflow/FLUX-Text-Basic-Workflow.json</td>
</tr>
</table>
</div>
- **2025-07-13**: π₯ The training code has been updated. The code now supports multi-scale training.
- **2025-07-13**: π₯ Update the low-VRAM version of the Gradio demo, which It currently requires 25GB of VRAM to run. Looking forward to more efficient, lower-memory solutions from the community.
- **2025-07-08**: π₯ ComfyUI Node is supported! You can now build an workflow based on FLUX-Text for editing posters. It is definitely worth trying to set up a workflow to automatically enhance product image service information and service scope. Meanwhile, utilizing the first and last frames enables the creation of video data with text effects. Thanks to the [community work](https://github.com/AMAP-ML/FluxText/issues/4), FLUX-Text was run on 8GB VRAM.
<div align="center">
<table>
<tr>
<td><img src="assets/comfyui.png" alt="workflow/FLUX-Text-Workflow.json" width="400"/></td>
</tr>
<tr>
<td align="center">workflow/FLUX-Text-Workflow.json</td>
</tr>
</table>
</div>
<div align="center">
<table>
<tr>
<td><img src="assets/ori_img1.png" alt="assets/ori_img1.png" width="200"/></td>
<td><img src="assets/new_img1.png" alt="assets/new_img1.png" width="200"/></td>
<td><img src="assets/ori_img2.png" alt="assets/ori_img2.png" width="200"/></td>
<td><img src="assets/new_img2.png" alt="assets/new_img2.png" width="200"/></td>
</tr>
<tr>
<td align="center">original image</td>
<td align="center">edited image</td>
<td align="center">original image</td>
<td align="center">edited image</td>
</tr>
</table>
</div>
<div align="center">
<table>
<tr>
<td><img src="assets/video_end1.png" alt="assets/video_end1.png" width="400"/></td>
<td><img src="assets/video1.gif" alt="assets/video1.gif" width="400"/></td>
</tr>
<tr>
<td><img src="assets/video_end2.png" alt="assets/video_end2.png" width="400"/></td>
<td><img src="assets/video2.gif" alt="assets/video2.gif" width="400"/></td>
</tr>
<tr>
<td align="center">last frame</td>
<td align="center">video</td>
</tr>
</table>
</div>
- **2025-07-04**: π₯ We have released gradio demo! You can now try out FLUX-Text.
<div align="center">
<table>
<tr>
<td><img src="assets/gradio_1.png" alt="Example 1" width="400"/></td>
<td><img src="assets/gradio_2.png" alt="Example 2" width="400"/></td>
</tr>
<tr>
<td align="center">Example 1</td>
<td align="center">Example 2</td>
</tr>
</table>
</div>
- **2025-07-03**: π₯ We have released our [pre-trained checkpoints](https://huggingface.co/GD-ML/FLUX-Text/) on Hugging Face! You can now try out FLUX-Text with the official weights.
- **2025-06-26**: βοΈ Inference and evaluate code are released. Once we have ensured that everything is functioning correctly, the new model will be merged into this repository.
## Todo List
1. - [x] Inference code
2. - [x] Pre-trained weights
3. - [x] Gradio demo
4. - [x] ComfyUI
5. - [x] Training code
## π οΈ Installation
We recommend using Python 3.10 and PyTorch with CUDA support. To set up the environment:
```bash
# Create a new conda environment
conda create -n flux_text python=3.10
conda activate flux_text
# Install other dependencies
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
pip install Pillow==9.5.0
```
## π€ Model Introduction
FLUX-Text is an open-source version of the scene text editing model. FLUX-Text can be used for editing posters, emotions, and more. The table below displays the list of text editing models we currently offer, along with their foundational information.
<table style="border-collapse: collapse; width: 100%;">
<tr>
<th style="text-align: center;">Model Name</th>
<th style="text-align: center;">Image Resolution</th>
<th style="text-align: center;">Memory Usage</th>
<th style="text-align: center;">English Sen.Acc</th>
<th style="text-align: center;">Chinese Sen.Acc</th>
<th style="text-align: center;">Download Link</th>
</tr>
<tr>
<th style="text-align: center;">FLUX-Text-512</th>
<th style="text-align: center;">512*512</th>
<th style="text-align: center;">34G</th>
<th style="text-align: center;">0.8419</th>
<th style="text-align: center;">0.7132</th>
<th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_512">π€ HuggingFace</a></th>
</tr>
<tr>
<th style="text-align: center;">FLUX-Text</th>
<th style="text-align: center;">Multi Resolution</th>
<th style="text-align: center;">34G for (512*512)</th>
<th style="text-align: center;">0.8228</th>
<th style="text-align: center;">0.7161</th>
<th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_multisize">π€ HuggingFace</a></th>
</tr>
</table>
## π₯ ComfyUI
<details close>
<summary> Installing via GitHub </summary>
First, install and set up [ComfyUI](https://github.com/comfyanonymous/ComfyUI), and then follow these steps:
1. **Clone FLUXText Repository**:
```shell
git clone https://github.com/AMAP-ML/FluxText.git
```
2. **Install FluxText**:
```shell
cd FluxText && pip install -r requirements.txt
```
3. **Integrate FluxText Comfy Nodes with ComfyUI**:
- **Symbolic Link (Recommended)**:
```shell
ln -s $(pwd)/ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/
```
- **Copy Directory**:
```shell
cp -r ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/
```
</details>
## π₯ Quick Start
Here's a basic example of using FLUX-Text:
```python
import numpy as np
from PIL import Image
import torch
import yaml
from src.flux.condition import Condition
from src.flux.generate_fill import generate_fill
from src.train.model import OminiModelFIll
from safetensors.torch import load_file
config_path = ""
lora_path = ""
with open(config_path, "r") as f:
config = yaml.safe_load(f)
model = OminiModelFIll(
flux_pipe_id=config["flux_path"],
lora_config=config["train"]["lora_config"],
device=f"cuda",
dtype=getattr(torch, config["dtype"]),
optimizer_config=config["train"]["optimizer"],
model_config=config.get("model", {}),
gradient_checkpointing=True,
byt5_encoder_config=None,
)
state_dict = load_file(lora_path)
state_dict_new = {x.replace('lora_A', 'lora_A.default').replace('lora_B', 'lora_B.default').replace('transformer.', ''): v for x, v in state_dict.items()}
model.transformer.load_state_dict(state_dict_new, strict=False)
pipe = model.flux_pipe
prompt = "lepto college of education, the written materials on the picture: LESOTHO , COLLEGE OF , RE BONA LESELI LESEL , EDUCATION ."
hint = Image.open("assets/hint.png").resize((512, 512)).convert('RGB')
img = Image.open("assets/hint_imgs.jpg").resize((512, 512))
condition_img = Image.open("assets/hint_imgs_word.png").resize((512, 512)).convert('RGB')
hint = np.array(hint) / 255
condition_img = np.array(condition_img)
condition_img = (255 - condition_img) / 255
condition_img = [condition_img, hint, img]
position_delta = [0, 0]
condition = Condition(
condition_type='word_fill',
condition=condition_img,
position_delta=position_delta,
)
generator = torch.Generator(device="cuda")
res = generate_fill(
pipe,
prompt=prompt,
conditions=[condition],
height=512,
width=512,
generator=generator,
model_config=config.get("model", {}),
default_lora=True,
)
res.images[0].save('flux_fill.png')
```
## π€ gradio
You can upload the glyph image and mask image to edit text region. Or you can use `manual edit` to obtain glyph image and mask image.
first, download the model weight and config in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text)
```bash
python app.py --model_path xx.safetensors --config_path config.yaml
```
## πͺπ» Training
1. Download training dataset [**AnyWord-3M**](https://modelscope.cn/datasets/iic/AnyWord-3M/summary) from ModelScope, unzip all \*.zip files in each subfolder, then open *\*.json* and modify the `data_root` with your own path of *imgs* folder for each sub dataset.
2. Download the ODM weights in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text/blob/main/epoch_100.pt).
3. (Optional) Download the pretrained weight in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text).
4. Run the training scripts. With 48GB of VRAM, you can train at 512Γ512 resolution with a batch size of 2.
```bash
bash train/script/train_word.sh
```
## π Evaluation
For [Anytext-benchmark](https://modelscope.cn/datasets/iic/AnyText-benchmark/summary), please set the **config_path**, **model_path**, **json_path**, **output_dir** in the `eval/gen_imgs_anytext.sh` and generate the text editing results.
```bash
bash eval/gen_imgs_anytext.sh
```
For `Sen.ACC, NED, FID and LPIPS` evaluation, use the scripts in the `eval` folder.
```bash
bash eval/eval_ocr.sh
bash eval/eval_fid.sh
bash eval/eval_lpips.sh
```
## π Results
<img src='assets/method_result.png'>
## πΉ Acknowledgement
Our work is primarily based on [OminiControl](https://github.com/Yuanshi9815/OminiControl), [AnyText](https://github.com/tyxsspa/AnyText), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Phantom](https://github.com/Phantom-video/Phantom). We are sincerely grateful for their excellent works.
## π Citation
If you find our paper and code helpful for your research, please consider starring our repository β and citing our work βοΈ.
```bibtex
@misc{lan2025fluxtext,
title={FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing},
author={Rui Lan and Yancheng Bai and Xu Duan and Mingxing Li and Lei Sun and Xiangxiang Chu},
year={2025},
eprint={2505.03329},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
``` |