How to run a Flux gguf model in Python

#41

by DSpider19 - opened Dec 27, 2024

Dec 27, 2024

Hello,

I'm trying to use the quantized version of Flux-dev, but all doc I find online uses ComfyUI.
Any help on running the model in Python ( I'm used to work with llama.cpp, but that's more for text2text models and not image variations).

Any help is much appreciated

Thanks

city96

Owner Dec 31, 2024

Hey. Good news is that diffusers recently added support for GGUF. The documentation for how to use it is live here - it should be enough to get you started.

Nosok

Jan 20

It is not entirely clear how to use the local model (downloaded). If I try to specify ckpt_path as a local directory, I get an error. Can you give me a simple code example?

Vargol

14 days ago

Probably a bit late of the OP but here's a basic GGUF using diffusers script

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("cuda")

height, width = 1024, 1024

images = pipeline(
    prompt=prompt,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("cuda").manual_seed(42)
).images[0]

images.save("gguf_image.png")

for mac users, a couple of modifications


from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch

torch.mps.set_per_process_memory_fraction(0.0)

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")

height, width = 1024, 1024

images = pipeline(
    prompt=prompt,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("gguf_image.png")

or an alternative for Mac users with model unloading, runs better with lower memory configurations even though MacOS does a pretty good job of swapping out the parts of a model
that its done with.

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
import gc

torch.mps.set_per_process_memory_fraction(0.0)

def flush():
    gc.collect()
    torch.mps.empty_cache()
    gc.collect()
    torch.mps.empty_cache()

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

ckpt_id = "black-forest-labs/FLUX.1-dev"

pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=None,
    vae=None,
    torch_dtype=torch.bfloat16,
).to("mps")

with torch.no_grad():
    print("Encoding prompts.")
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt, prompt_2=prompt, max_sequence_length=256
    )


del pipeline

flush()

ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")

print("Running denoising.")
height, width = 1024, 1024
# No need to wrap it up under `torch.no_grad()` as pipeline call method
# is already wrapped under that.
images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("compile_image.png")

jerryrt

11 days ago

•

edited 11 days ago

Hi, first thanks for your code sample.

It is by far the only sample code to I can find to use this GGUF model with python. However, my GTX1080 failed with it. Even with the smallest one: flux1-dev-Q2_K.gguf

Can you share more information about your runtime setup?

Probably a bit late of the OP but here's a basic GGUF using diffusers script

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("cuda")

height, width = 1024, 1024

images = pipeline(
    prompt=prompt,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("cuda").manual_seed(42)
).images[0]

images.save("gguf_image.png")

for mac users, a couple of modifications


from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch

torch.mps.set_per_process_memory_fraction(0.0)

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

#ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")

height, width = 1024, 1024

images = pipeline(
    prompt=prompt,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("gguf_image.png")

or an alternative for Mac users with model unloading, runs better with lower memory configurations even though MacOS does a pretty good job of swapping out the parts of a model
that its done with.

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
import gc

torch.mps.set_per_process_memory_fraction(0.0)

def flush():
    gc.collect()
    torch.mps.empty_cache()
    gc.collect()
    torch.mps.empty_cache()

prompt = "a moonim dressed as a knight, riding a horse towards a medieval castle"

ckpt_id = "black-forest-labs/FLUX.1-dev"

pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=None,
    vae=None,
    torch_dtype=torch.bfloat16,
).to("mps")

with torch.no_grad():
    print("Encoding prompts.")
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt, prompt_2=prompt, max_sequence_length=256
    )


del pipeline

flush()

ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q8_0.gguf"
ckpt_path = "/Volumes/SSD2TB/AI/caches/models/flux1-dev-Q8_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")

print("Running denoising.")
height, width = 1024, 1024
# No need to wrap it up under `torch.no_grad()` as pipeline call method
# is already wrapped under that.
images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=15,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("compile_image.png")

Vargol

10 days ago

Hi, first thanks for your code sample.

It is by far the only sample code to I can find to use this GGUF model with python. However, my GTX1080 failed with it. Even with the smallest one: flux1-dev-Q2_K.gguf

Can you share more information about your runtime setup?

I'm a Mac user, with a M3 iMac with 24Gb of Unified memory. Due to the unified memory architecture most of the best tricks for reducing memory usage do not work on Macs (they move stuff between
VRAM and normal RAM) so I'm not an expert on them. Note that when people say they are running Flux on 8 Gb of VRAM, they really mean 8GB of VRAM and XGB of system RAM.
I don't know what that X needs to be.
With the Q8 GGUF version and using the longer script I have just enough memory left over for a couple of heavy web pages and thunderbird running for email while running Flux (without controlnets etc) without using swap disc. You may have to look into using one of the quantised T5 models too to get Flux running again something that seems to not work on Macs (the int8 version seemed to use the same amount of Unified RAM)

Having said that...

First I'd try adding
pipeline.enable_sequential_cpu_offload() and pipeline.enable_vae_tiling(), the first of those will have a impact of the speed it runs compared to not using it, but as you can't run it anyway...

So the main code becomes

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

pipeline.enable_sequential_cpu_offload()
pipeline.enable_vae_tiling()

height, width = 1024, 1024

images = pipeline(...

Note I've removed the to('cuda') from the pipeline creation as that is a requirement for enable_sequential_cpu_offload.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment