Running FLUX.1-dev GGUF Q8 in Python

Speed-up FLUX.1-dev with GGUF quantization

Page content

FLUX.1-dev is a powerful text-to-image model that produces stunning results, but its 24GB+ memory requirement makes it challenging to run on many systems. GGUF quantization of FLUX.1-dev offers a solution, reducing memory usage by approximately 50% while maintaining excellent image quality.

example output Q8 quantized FLUX.1-dev - asic This image was generated using the Q8 quantized FLUX.1-dev model with GGUF format, demonstrating that quality is preserved even with reduced memory footprint.

What is GGUF Quantization?

GGUF (GPT-Generated Unified Format) is a quantization format originally developed for language models but now supported for diffusion models like FLUX. Quantization reduces model size by storing weights in lower precision formats (8-bit, 6-bit, or 4-bit) instead of full 16-bit or 32-bit precision.

For FLUX.1-dev, the transformer component (the largest part of the model) can be quantized, reducing its memory footprint from approximately 24GB to 12GB with Q8_0 quantization, or even lower with more aggressive quantization levels.

Benefits of GGUF Quantization

The primary advantages of using GGUF quantized FLUX models include:

  • Reduced Memory Usage: Cut VRAM requirements in half, making FLUX.1-dev accessible on more hardware
  • Maintained Quality: Q8_0 quantization preserves image quality with minimal visible differences
  • Faster Loading: Quantized models load faster due to smaller file sizes
  • Lower Power Consumption: Reduced memory usage translates to lower power draw during inference

In our testing, the quantized model uses approximately 12-15GB VRAM compared to 24GB+ for the full model, while generation time remains similar.

Installation and Setup

To use GGUF quantized FLUX.1-dev, you’ll need the gguf package in addition to the standard diffusers dependencies. If you’re already using FLUX for text-to-image generation, you’re familiar with the base setup.

If you’re using uv as your Python package manager, you can install the required packages with:

uv pip install -U diffusers torch transformers gguf

Or with standard pip:

pip install -U diffusers torch transformers gguf

Implementation

The key difference when using GGUF quantized models is that you load the transformer separately using FluxTransformer2DModel.from_single_file() with GGUFQuantizationConfig, then pass it to the pipeline. If you need a quick reference for Python syntax, check the Python Cheatsheet. Here’s a complete working example:

import os
import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig

# Paths
gguf_model_path = "/path/to/flux1-dev-Q8_0.gguf"
base_model_path = "/path/to/FLUX.1-dev-config"  # Config files only

# Load GGUF quantized transformer
print(f"Loading GGUF quantized transformer from: {gguf_model_path}")
transformer = FluxTransformer2DModel.from_single_file(
    gguf_model_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    config=base_model_path,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

# Create pipeline with quantized transformer
print(f"Creating pipeline with base model: {base_model_path}")
pipe = FluxPipeline.from_pretrained(
    base_model_path,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

# Enable CPU offloading (required for GGUF)
pipe.enable_model_cpu_offload()
# Note: enable_sequential_cpu_offload() is NOT compatible with GGUF

# Generate image
prompt = "A futuristic cityscape at sunset with neon lights"
image = pipe(
    prompt,
    height=496,
    width=680,
    guidance_scale=3.5,
    num_inference_steps=60,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(42)
).images[0]

image.save("output.jpg")

Important Considerations

Model Configuration Files

When using GGUF quantization, you still need the model configuration files from the original FLUX.1-dev model. These include:

  • model_index.json - Pipeline structure
  • Component configs (transformer, text_encoder, text_encoder_2, vae, scheduler)
  • Tokenizer files
  • Text encoder and VAE weights (these are not quantized)

The transformer weights come from the GGUF file, but all other components require the original model files.

CPU Offloading Compatibility

Important: enable_sequential_cpu_offload() is not compatible with GGUF quantized models and will cause a KeyError: None error. I’m just using enable_model_cpu_offload() instead when working with quantized transformers.

Quantization Levels

Available quantization levels for FLUX.1-dev include:

  • Q8_0: Best quality, ~14-15GB memory (recommended)
  • Q6_K: Good balance, ~12GB memory
  • Q4_K: Maximum compression, ~8GB memory (I assume it affects the quality not in a good way)

For most use cases, Q8_0 provides the best balance between memory savings and image quality.

Performance Comparison

In our testing with identical prompts and settings:

Model VRAM Usage Generation Time Quality
Full FLUX.1-dev 24GB? I don’t have a GPU that big Excellent (I think)
Full FLUX.1-dev ~3GB with sequential_cpu_offload() ~329s Excellent
GGUF Q8_0 ~14-15GB ~98s !!! Excellent
GGUF Q6_K ~10-12GB ~116s Very Good

The quantized model, because it requires less CPU offload now, has more then 3 times faster generation speed while using significantly less memory, making it practical for systems with limited VRAM.

I tested both models with the prompt

A futuristic close-up of a transformer inference ASIC chip with intricate circuitry, glowing blue light emitting from dense matrix multiply units and low-precision ALUs, surrounded by on-chip SRAM buffers and quantization pipelines, rendered in hyper-detailed photorealistic style with a cold, clinical lighting scheme.

The sample output of FLUX.1-dev Q8 is the cover image to this post - see above.

The sample output of non-quantized FLUX.1-dev is below:

example image produced by non-quantized FLUX.1-dev - asic

I don’t see much of the difference in quality.

Conclusion

GGUF quantization makes FLUX.1-dev accessible to a broader range of hardware while maintaining the high-quality image generation the model is known for. By reducing memory requirements by approximately 50%, you can run state-of-the-art text-to-image generation on more affordable hardware without significant quality loss.

The implementation is straightforward with the diffusers library, requiring only minor changes to the standard FLUX pipeline setup. For most users, Q8_0 quantization provides the optimal balance between memory efficiency and image quality.

If you’re working with FLUX.1-Kontext-dev for image augmentation, similar quantization techniques may become available in the future.

References