FLUX.2 launch
This commit is contained in:
191
docs/flux2_dev_hf.md
Normal file
191
docs/flux2_dev_hf.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# 🧨 Running the model with diffusers
|
||||
|
||||
## Getting started
|
||||
|
||||
Install diffusers from `main`
|
||||
|
||||
```sh
|
||||
pip install git+https://github.com/huggingface/diffusers.git
|
||||
```
|
||||
|
||||
After accepting the gating on this repository, login with Hugging Face on your terminal
|
||||
```sh
|
||||
hf auth login
|
||||
```
|
||||
|
||||
See below for inference instructions on different GPUs.
|
||||
|
||||
---
|
||||
|
||||
## 💾 Lower VRAM (~24-32G) - RTX 4090 and 5090
|
||||
|
||||
Those with 24-32GB of VRAM can use the model with **4-bit quantization**
|
||||
|
||||
### 4-bit transformer and remote text-encoder (~18G of VRAM)
|
||||
|
||||
The diffusers team is introducing a remote text-encoder for this release.
|
||||
The text-embeddings are calculated in bf16 in the cloud and you only load the transformer into VRAM (this setting can get as low as ~18G of VRAM)
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
|
||||
from diffusers.utils import load_image
|
||||
from huggingface_hub import get_token
|
||||
import requests
|
||||
import io
|
||||
|
||||
repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
|
||||
device = "cuda:0"
|
||||
torch_dtype = torch.bfloat16
|
||||
|
||||
def remote_text_encoder(prompts):
|
||||
response = requests.post(
|
||||
"https://remote-text-encoder-flux-2.huggingface.co/predict",
|
||||
json={"prompt": prompts},
|
||||
headers={
|
||||
"Authorization": f"Bearer {get_token()}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
)
|
||||
prompt_embeds = torch.load(io.BytesIO(response.content))
|
||||
|
||||
return prompt_embeds.to(device)
|
||||
|
||||
pipe = Flux2Pipeline.from_pretrained(
|
||||
repo_id, transformer=transformer, text_encoder=None, torch_dtype=torch_dtype
|
||||
).to(device)
|
||||
|
||||
prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
|
||||
|
||||
image = pipe(
|
||||
prompt_embeds=remote_text_encoder(prompt),
|
||||
#image=load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png") #optional image input
|
||||
generator=torch.Generator(device=device).manual_seed(42),
|
||||
num_inference_steps=50, #28 steps can be a good trade-off
|
||||
guidance_scale=4,
|
||||
).images[0]
|
||||
|
||||
image.save("flux2_output.png")
|
||||
```
|
||||
|
||||
### 4-bit transformer and 4-bit text-encoder (~20G of VRAM)
|
||||
|
||||
Load both the text-encoder and the transformer in 4-bit.
|
||||
The text-encoder is offloaded from VRAM for the transformer to run with `pipe.enable_model_cpu_offload()`, making sure both will fit.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import Mistral3ForConditionalGeneration
|
||||
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
|
||||
|
||||
repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
|
||||
device = "cuda:0"
|
||||
torch_dtype = torch.bfloat16
|
||||
|
||||
transformer = Flux2Transformer2DModel.from_pretrained(
|
||||
repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
|
||||
)
|
||||
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
|
||||
repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"
|
||||
)
|
||||
|
||||
pipe = Flux2Pipeline.from_pretrained(
|
||||
repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
|
||||
)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL + Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
|
||||
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
#image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #multi-image input
|
||||
generator=torch.Generator(device=device).manual_seed(42),
|
||||
num_inference_steps=50,
|
||||
guidance_scale=4,
|
||||
).images[0]
|
||||
|
||||
image.save("flux2_output.png")
|
||||
```
|
||||
|
||||
To understand how different quantizations affect the model's abilities and quality, access the [FLUX.2 on diffusers](https://huggingface.co/blog/flux2) blog
|
||||
|
||||
---
|
||||
|
||||
## 💿 More VRAM (80G+)
|
||||
|
||||
Even an H100 can't hold the text-encoder, transormer and VAE at the same time. However, here it is a matter of activating the `pipe.enable_model_cpu_offload()`
|
||||
And for H200, B200 or larger carts, everything fits.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import Flux2Pipeline
|
||||
|
||||
repo_id = "black-forest-labs/FLUX.2-dev"
|
||||
device = "cuda:0"
|
||||
torch_dtype = torch.bfloat16
|
||||
|
||||
pipe = Flux2Pipeline.from_pretrained(
|
||||
repo_id, torch_dtype=torch_dtype
|
||||
)
|
||||
pipe.enable_model_cpu_offload() #deactivate for >80G VRAM carts like H200, B200, etc. and do a `pipe.to(device)` instead
|
||||
|
||||
prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
|
||||
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
#image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #multi-image input
|
||||
generator=torch.Generator(device=device).manual_seed(42),
|
||||
num_inference_steps=50,
|
||||
guidance_scale=4,
|
||||
).images[0]
|
||||
|
||||
image.save("flux2_output.png")
|
||||
```
|
||||
|
||||
### Remote text-encoder + H100
|
||||
`pipe.enable_model_cpu_offload()` slows you down a bit. You can move as fast as possible on the H100 with the remote text-encoder
|
||||
```py
|
||||
import torch
|
||||
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
|
||||
from huggingface_hub import get_token
|
||||
import requests
|
||||
import io
|
||||
|
||||
repo_id = "black-forest-labs/FLUX.2-dev"
|
||||
device = "cuda:0"
|
||||
torch_dtype = torch.bfloat16
|
||||
|
||||
def remote_text_encoder(prompts):
|
||||
response = requests.post(
|
||||
"https://remote-text-encoder-flux-2.huggingface.co/predict",
|
||||
json={"prompt": prompts},
|
||||
headers={
|
||||
"Authorization": f"Bearer {get_token()}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
)
|
||||
assert response.status_code == 200, f"{response.status_code=}"
|
||||
prompt_embeds = torch.load(io.BytesIO(response.content))
|
||||
|
||||
return prompt_embeds.to(device)
|
||||
|
||||
pipe = Flux2Pipeline.from_pretrained(
|
||||
repo_id, text_encoder=None, torch_dtype=torch_dtype
|
||||
).to(device)
|
||||
|
||||
prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL + Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
|
||||
|
||||
image = pipe(
|
||||
prompt_embeds=remote_text_encoder(prompt),
|
||||
#image=[load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png")] #optional multi-image input
|
||||
generator=torch.Generator(device=device).manual_seed(42),
|
||||
num_inference_steps=50,
|
||||
guidance_scale=4,
|
||||
).images[0]
|
||||
|
||||
image.save("flux2_output.png")
|
||||
```
|
||||
|
||||
## 🧮 Other VRAM sizes
|
||||
|
||||
If you have different GPU sizes, you can experiment with different quantizations, for example, for 40-48G VRAM GPUs, (8-bit) quantization instead of 4-bit can be a good trade-off. You can learn more on the [diffusers FLUX.2 release blog](https://huggingface.co/blog/flux2)
|
||||
73
docs/flux2_with_prompt_upsampling.md
Normal file
73
docs/flux2_with_prompt_upsampling.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Prompt upsampling with FLUX.2
|
||||
|
||||
Prompt upsampling uses a large vision language model to expand and enrich your prompts before generation, which can significantly improve results for reasoning-heavy and complex generation tasks.
|
||||
|
||||
## When to use prompt upsampling
|
||||
|
||||
Prompt upsampling is particularly effective for prompts requiring reasoning or complex interpretation:
|
||||
|
||||
- **Text generation in images**: Creating memes, posters, or images where the model needs to generate creative or contextually appropriate text
|
||||
- **Image-based instructions**: Prompts where the input image contains overlaid text, arrows, or annotations that need to be interpreted (e.g., "follow the instructions in the image", "read the diagram and generate the result")
|
||||
- **Code and math reasoning**: Generating visualizations of algorithms, mathematical concepts, or code flow diagrams where logical structure is important
|
||||
|
||||
For simple, direct prompts (e.g., "a red car"), prompt upsampling may not provide significant benefits.
|
||||
|
||||
## Methods
|
||||
|
||||
We provide two methods for prompt upsampling:
|
||||
|
||||
### 1. API-based prompt upsampling (recommended)
|
||||
|
||||
API-based prompt upsampling via [OpenRouter](https://openrouter.ai/) generally produces better results by leveraging more capable models.
|
||||
|
||||
Set your API key as an environment variable:
|
||||
|
||||
```bash
|
||||
export OPENROUTER_API_KEY="<api_key>"
|
||||
```
|
||||
|
||||
Then run the CLI with upsampling enabled:
|
||||
```bash
|
||||
|
||||
export PYTHONPATH=src
|
||||
python scripts/cli.py --upsample_prompt_mode=openrouter
|
||||
```
|
||||
|
||||
You can switch between different models using `--openrouter_model=<model_name>`.
|
||||
|
||||
Alternatively, you can just start the CLI via
|
||||
|
||||
```bash
|
||||
export PYTHONPATH=src
|
||||
python scripts/cli.py
|
||||
```
|
||||
|
||||
and choose your prompt upsampling model interactively.
|
||||
|
||||
**Example output:**
|
||||
|
||||
| Prompt: "Make a meme about generating memes with this model" |
|
||||
|:---:|
|
||||
| <img src="../assets/t2i_upsample_example.png" alt="Output" width="512"> |
|
||||
|
||||
### 2. Local prompt upsampling
|
||||
|
||||
Local prompt upsampling uses [`Mistral-Small-3.2-24B-Instruct-2506`](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506), which is the model we use for text encoding in `FLUX.2 [dev]`. This option requires no API keys but may produce less detailed expansions.
|
||||
|
||||
To enable local prompt upsampling, use `--upsample_prompt_mode=local`.
|
||||
|
||||
**Example:**
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th colspan="2" style="text-align: center;">Prompt: "Describe what the red arrow is seeing"</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Input</th>
|
||||
<th>Output</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="center"><img src="../assets/i2i_upsample_input.png" alt="Input image"></td>
|
||||
<td align="center"><img src="../assets/i2i_upsample_example.png" alt="Output image"></td>
|
||||
</tr>
|
||||
</table>
|
||||
Reference in New Issue
Block a user