FLUX.2 launch

2025-11-25 07:25:25 -08:00
commit e80b84ed9f
24 changed files with 3238 additions and 0 deletions
--- a/docs/flux2_dev_hf.md
+++ b/docs/flux2_dev_hf.md
@@ -0,0 +1,191 @@
+# 🧨 Running the model with diffusers
+
+## Getting started 
+
+Install diffusers from `main` 
+
+```sh
+pip install git+https://github.com/huggingface/diffusers.git
+```
+
+After accepting the gating on this repository, login with Hugging Face on your terminal
+```sh
+hf auth login
+```
+
+See below for inference instructions on different GPUs.
+
+---
+
+## 💾 Lower VRAM (~24-32G) - RTX 4090 and 5090
+
+Those with 24-32GB of VRAM can use the model with **4-bit quantization**
+
+### 4-bit transformer and remote text-encoder (~18G of VRAM)
+
+The diffusers team is introducing a remote text-encoder for this release.
+The text-embeddings are calculated in bf16 in the cloud and you only load the transformer into VRAM (this setting can get as low as ~18G of VRAM)
+
+```py
+import torch
+from diffusers import Flux2Pipeline, Flux2Transformer2DModel
+from diffusers.utils import load_image
+from huggingface_hub import get_token
+import requests
+import io
+
+repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
+device = "cuda:0"
+torch_dtype = torch.bfloat16
+
+def remote_text_encoder(prompts):
+    response = requests.post(
+        "https://remote-text-encoder-flux-2.huggingface.co/predict",
+        json={"prompt": prompts},
+        headers={
+            "Authorization": f"Bearer {get_token()}",
+            "Content-Type": "application/json"
+        }
+    )
+    prompt_embeds = torch.load(io.BytesIO(response.content))
+
+    return prompt_embeds.to(device)
+
+pipe = Flux2Pipeline.from_pretrained(
+    repo_id, transformer=transformer, text_encoder=None, torch_dtype=torch_dtype
+).to(device)
+
+prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
+
+image = pipe(
+    prompt_embeds=remote_text_encoder(prompt),
+    #image=load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png") #optional image input
+    generator=torch.Generator(device=device).manual_seed(42),
+    num_inference_steps=50, #28 steps can be a good trade-off
+    guidance_scale=4,
+).images[0]
+
+image.save("flux2_output.png")
+```
+
+### 4-bit transformer and 4-bit text-encoder (~20G of VRAM)
+
+Load both the text-encoder and the transformer in 4-bit. 
+The text-encoder is offloaded from VRAM for the transformer to run with `pipe.enable_model_cpu_offload()`, making sure both will fit.
+
+```py
+import torch
+from transformers import Mistral3ForConditionalGeneration
+from diffusers import Flux2Pipeline, Flux2Transformer2DModel
+
+repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
+device = "cuda:0"
+torch_dtype = torch.bfloat16
+
+transformer = Flux2Transformer2DModel.from_pretrained(
+    repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
+)
+text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
+    repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"
+)
+
+pipe = Flux2Pipeline.from_pretrained(
+    repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
+)
+pipe.enable_model_cpu_offload()
+
+prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL + Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
+
+image = pipe(
+    prompt=prompt,
+    #image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #multi-image input
+    generator=torch.Generator(device=device).manual_seed(42),
+    num_inference_steps=50,
+    guidance_scale=4,
+).images[0]
+
+image.save("flux2_output.png")
+``` 
+
+To understand how different quantizations affect the model's abilities and quality, access the [FLUX.2 on diffusers](https://huggingface.co/blog/flux2) blog
+
+---
+
+## 💿 More VRAM (80G+)
+
+Even an H100 can't hold the text-encoder, transormer and VAE at the same time. However, here it is a matter of activating the `pipe.enable_model_cpu_offload()`
+And for H200, B200 or larger carts, everything fits.
+
+```py
+import torch
+from diffusers import Flux2Pipeline
+
+repo_id = "black-forest-labs/FLUX.2-dev"
+device = "cuda:0"
+torch_dtype = torch.bfloat16
+
+pipe = Flux2Pipeline.from_pretrained(
+    repo_id, torch_dtype=torch_dtype
+)
+pipe.enable_model_cpu_offload() #deactivate for >80G VRAM carts like H200, B200, etc. and do a `pipe.to(device)` instead
+
+prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
+
+image = pipe(
+    prompt=prompt,
+    #image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #multi-image input
+    generator=torch.Generator(device=device).manual_seed(42),
+    num_inference_steps=50,
+    guidance_scale=4,
+).images[0]
+
+image.save("flux2_output.png")
+```
+
+### Remote text-encoder + H100
+`pipe.enable_model_cpu_offload()` slows you down a bit. You can move as fast as possible on the H100 with the remote text-encoder 
+```py
+import torch
+from diffusers import Flux2Pipeline, Flux2Transformer2DModel
+from huggingface_hub import get_token
+import requests
+import io
+
+repo_id = "black-forest-labs/FLUX.2-dev"
+device = "cuda:0"
+torch_dtype = torch.bfloat16
+
+def remote_text_encoder(prompts):
+    response = requests.post(
+        "https://remote-text-encoder-flux-2.huggingface.co/predict",
+        json={"prompt": prompts},
+        headers={
+            "Authorization": f"Bearer {get_token()}",
+            "Content-Type": "application/json"
+        }
+    )
+    assert response.status_code == 200, f"{response.status_code=}"
+    prompt_embeds = torch.load(io.BytesIO(response.content))
+
+    return prompt_embeds.to(device)
+
+pipe = Flux2Pipeline.from_pretrained(
+    repo_id, text_encoder=None, torch_dtype=torch_dtype
+).to(device)
+
+prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL + Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
+
+image = pipe(
+    prompt_embeds=remote_text_encoder(prompt),
+    #image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #optional multi-image input
+    generator=torch.Generator(device=device).manual_seed(42),
+    num_inference_steps=50,
+    guidance_scale=4,
+).images[0]
+
+image.save("flux2_output.png")
+```
+
+## 🧮 Other VRAM sizes
+
+If you have different GPU sizes, you can experiment with different quantizations, for example, for 40-48G VRAM GPUs, (8-bit) quantization instead of 4-bit can be a good trade-off. You can learn more on the [diffusers FLUX.2 release blog](https://huggingface.co/blog/flux2)