How to run a Flux gguf model in Python
Hello,
I'm trying to use the quantized version of Flux-dev, but all doc I find online uses ComfyUI.
Any help on running the model in Python ( I'm used to work with llama.cpp, but that's more for text2text models and not image variations).
Any help is much appreciated
Thanks
It is not entirely clear how to use the local model (downloaded). If I try to specify ckpt_path as a local directory, I get an error. Can you give me a simple code example?
Probably a bit late of the OP but here's a basic GGUF using diffusers script
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"
#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
torch_dtype=torch.bfloat16,
).to("cuda")
height, width = 1024, 1024
images = pipeline(
prompt=prompt,
num_inference_steps=15,
guidance_scale=5.0,
height=height,
width=width,
generator=torch.Generator("cuda").manual_seed(42)
).images[0]
images.save("gguf_image.png")
for mac users, a couple of modifications
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
torch.mps.set_per_process_memory_fraction(0.0)
prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"
#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
torch_dtype=torch.bfloat16,
).to("mps")
height, width = 1024, 1024
images = pipeline(
prompt=prompt,
num_inference_steps=15,
guidance_scale=5.0,
height=height,
width=width,
generator=torch.Generator("mps").manual_seed(42)
).images[0]
images.save("gguf_image.png")
or an alternative for Mac users with model unloading, runs better with lower memory configurations even though MacOS does a pretty good job of swapping out the parts of a model
that its done with.
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
import gc
torch.mps.set_per_process_memory_fraction(0.0)
def flush():
gc.collect()
torch.mps.empty_cache()
gc.collect()
torch.mps.empty_cache()
prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"
ckpt_id = "black-forest-labs/FLUX.1-dev"
pipeline = FluxPipeline.from_pretrained(
ckpt_id,
transformer=None,
vae=None,
torch_dtype=torch.bfloat16,
).to("mps")
with torch.no_grad():
print("Encoding prompts.")
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
prompt=prompt, prompt_2=prompt, max_sequence_length=256
)
del pipeline
flush()
ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
transformer=transformer,
torch_dtype=torch.bfloat16,
).to("mps")
print("Running denoising.")
height, width = 1024, 1024
# No need to wrap it up under `torch.no_grad()` as pipeline call method
# is already wrapped under that.
images = pipeline(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=15,
guidance_scale=5.0,
height=height,
width=width,
generator=torch.Generator("mps").manual_seed(42)
).images[0]
images.save("compile_image.png")
Hi, first thanks for your code sample.
It is by far the only sample code to I can find to use this GGUF model with python. However, my GTX1080 failed with it. Even with the smallest one: flux1-dev-Q2_K.gguf
Can you share more information about your runtime setup?
Probably a bit late of the OP but here's a basic GGUF using diffusers script
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig import torch prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle" #ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf" ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf" transformer = FluxTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16, ).to("cuda") height, width = 1024, 1024 images = pipeline( prompt=prompt, num_inference_steps=15, guidance_scale=5.0, height=height, width=width, generator=torch.Generator("cuda").manual_seed(42) ).images[0] images.save("gguf_image.png")
for mac users, a couple of modifications
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig import torch torch.mps.set_per_process_memory_fraction(0.0) prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle" #ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf" ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf" transformer = FluxTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16, ).to("mps") height, width = 1024, 1024 images = pipeline( prompt=prompt, num_inference_steps=15, guidance_scale=5.0, height=height, width=width, generator=torch.Generator("mps").manual_seed(42) ).images[0] images.save("gguf_image.png")
or an alternative for Mac users with model unloading, runs better with lower memory configurations even though MacOS does a pretty good job of swapping out the parts of a model
that its done with.
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig import torch import gc torch.mps.set_per_process_memory_fraction(0.0) def flush(): gc.collect() torch.mps.empty_cache() gc.collect() torch.mps.empty_cache() prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle" ckpt_id = "black-forest-labs/FLUX.1-dev" pipeline = FluxPipeline.from_pretrained( ckpt_id, transformer=None, vae=None, torch_dtype=torch.bfloat16, ).to("mps") with torch.no_grad(): print("Encoding prompts.") prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt( prompt=prompt, prompt_2=prompt, max_sequence_length=256 ) del pipeline flush() ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf" ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf" transformer = FluxTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", text_encoder=None, text_encoder_2=None, tokenizer=None, tokenizer_2=None, transformer=transformer, torch_dtype=torch.bfloat16, ).to("mps") print("Running denoising.") height, width = 1024, 1024 # No need to wrap it up under `torch.no_grad()` as pipeline call method # is already wrapped under that. images = pipeline( prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds, num_inference_steps=15, guidance_scale=5.0, height=height, width=width, generator=torch.Generator("mps").manual_seed(42) ).images[0] images.save("compile_image.png")
Hi, first thanks for your code sample.
It is by far the only sample code to I can find to use this GGUF model with python. However, my GTX1080 failed with it. Even with the smallest one: flux1-dev-Q2_K.gguf
Can you share more information about your runtime setup?
I'm a Mac user, with a M3 iMac with 24Gb of Unified memory. Due to the unified memory architecture most of the best tricks for reducing memory usage do not work on Macs (they move stuff between
VRAM and normal RAM) so I'm not an expert on them. Note that when people say they are running Flux on 8 Gb of VRAM, they really mean 8GB of VRAM and XGB of system RAM.
I don't know what that X needs to be.
With the Q8 GGUF version and using the longer script I have just enough memory left over for a couple of heavy web pages and thunderbird running for email while running Flux (without controlnets etc) without using swap disc. You may have to look into using one of the quantised T5 models too to get Flux running again something that seems to not work on Macs (the int8 version seemed to use the same amount of Unified RAM)
Having said that...
First I'd try addingpipeline.enable_sequential_cpu_offload() and pipeline.enable_vae_tiling()
, the first of those will have a impact of the speed it runs compared to not using it, but as you can't run it anyway...
So the main code becomes
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipeline.enable_sequential_cpu_offload()
pipeline.enable_vae_tiling()
height, width = 1024, 1024
images = pipeline(...
Note I've removed the to('cuda')
from the pipeline creation as that is a requirement for enable_sequential_cpu_offload.