6fa6c5041b
Major unified memory optimization changes:
1. model_management.py: HIGH_VRAM → NORMAL_VRAM
- GB10 unified memory: offloading to CPU doesn't save physical RAM
(same pool), but NORMAL_VRAM allows per-layer partial loading when
memory is tight instead of all-or-nothing OOM
- text_encoder_offload_device() and vae_offload_device() now return
CPU (allows ComfyUI to offload unused models)
- intermediate_device() still returns GPU (VAE outputs must stay in
CUDA allocator for honest memory tracking)
- User can force HIGH_VRAM with --highvram if models fit
2. utils.py: copy=True → copy=False for tensor.to(device)
- On GB10 unified memory, copy=True creates a full duplicate in both
CPU and CUDA allocators simultaneously (ComfyUI issue #10896)
- copy=False makes .to(device) a zero-copy device label change since
both allocators draw from the same physical LPDDR5X
- Halves model loading memory usage when --disable-mmap is set
3. Removed --disable-dynamic-vram from ComfyUI flags
- Was preventing AIMDO (comfy_aimdo) from initializing
- AIMDO now activates: VBAR-based page-level VRAM management at 32MB
granularity instead of blunt .to(cpu) copies
- Falls back to NORMAL_VRAM per-layer loading if AIMDO has issues
4. Added CUDA_CACHE_MAXSIZE=4294967296 (4GB kernel cache)
- PTX→SASS kernel caching for sm_121 (GB10 Blackwell)
- 3x speedup on subsequent runs reported by DGX Spark community
5. System: vm.swappiness reduced from 60 to 1
- Swap thrashing on unified memory causes silent system freezes
- Near-zero swappiness ensures clean OOM kills instead
136 lines
4.5 KiB
YAML
136 lines
4.5 KiB
YAML
services:
|
|
comfyui:
|
|
build:
|
|
context: .
|
|
dockerfile: Dockerfile
|
|
args:
|
|
# Pin ComfyUI to a known-good commit/tag if desired
|
|
COMFYUI_REF: "${COMFYUI_REF:-master}"
|
|
# SageAttention ref (e.g., "main", "v2.2.0", or specific commit)
|
|
SAGEATTN_REF: "${SAGEATTN_REF:-main}"
|
|
|
|
image: sparkyui:cu130
|
|
container_name: comfyui
|
|
|
|
# GPU enablement
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: all
|
|
capabilities: [gpu]
|
|
|
|
# LAN exposure
|
|
ports:
|
|
- "${COMFYUI_PORT:-8188}:8188"
|
|
|
|
environment:
|
|
COMFYUI_PORT: "${COMFYUI_PORT:-8188}"
|
|
# Optimized for Grace-Blackwell unified memory architecture
|
|
# Key insight: DON'T use --gpu-only - let the unified memory fabric work naturally
|
|
COMFYUI_FLAGS: "${COMFYUI_FLAGS:---listen 0.0.0.0 --port 8188 --disable-pinned-memory --dont-upcast-attention}"
|
|
NVIDIA_VISIBLE_DEVICES: "all"
|
|
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
|
|
|
|
# Disable torch.compile/inductor - Triton doesn't support Blackwell sm_121a yet
|
|
TORCH_COMPILE_DISABLE: "1"
|
|
TORCHDYNAMO_DISABLE: "1"
|
|
|
|
# Grace-Blackwell unified memory — removed aggressive CUDA tuning (5/21):
|
|
# CUDA_CACHE_DISABLE, CUDA_DEVICE_MAX_CONNECTIONS, CUDA_DEVICE_MAX_COPY_CONNECTIONS,
|
|
# CUDA_MODULE_LOADING=EAGER, CUDA_MANAGED_FORCE_DEVICE_ALLOC, OMP_NUM_THREADS
|
|
# These were over-tuning. The ComfyUI flags + Sparky patch handle the architecture.
|
|
# Keeping only CUBLAS_WORKSPACE_CONFIG for determinism.
|
|
CUBLAS_WORKSPACE_CONFIG: ":0:0"
|
|
|
|
# CUDA kernel caching — PTX→SASS compilation cache for GB10 (sm_121)
|
|
# First run compiles kernels, subsequent runs reuse from disk. 3x speedup reported.
|
|
# 4GB cache covers all typical ComfyUI kernel variants.
|
|
CUDA_CACHE_MAXSIZE: "4294967296"
|
|
|
|
volumes:
|
|
# Models from existing ComfyUI install (read-only)
|
|
- ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro
|
|
|
|
# Custom nodes - comment out to use container-only (fresh) custom_nodes
|
|
# If mounted, ComfyUI-Manager installs persist across container restarts
|
|
- ${SPARKYUI_DATA_PATH}/custom_nodes:/opt/ComfyUI/custom_nodes
|
|
|
|
# Outputs/inputs/workflows - persistent across restarts
|
|
- ${SPARKYUI_DATA_PATH}/output:/opt/ComfyUI/output
|
|
- ${SPARKYUI_DATA_PATH}/input:/opt/ComfyUI/input
|
|
- ${SPARKYUI_DATA_PATH}/workflows:/opt/ComfyUI/workflows
|
|
|
|
# Wheel cache (optional - for prebuilt wheels)
|
|
- ${SPARKYUI_DATA_PATH}/wheels:/opt/wheels
|
|
|
|
# Sparky patches - Grace-Blackwell unified memory optimizations
|
|
# model_management.py: HIGH_VRAM→NORMAL_VRAM, intermediate_device()→cuda, soft_empty_cache skip,
|
|
# 95% vram_for_weights, UNIFIED_MEMORY detection, offload devices → cuda
|
|
# utils.py: copy=False on tensor.to(device) — avoids double-allocation on unified memory
|
|
# where CPU and GPU share the same physical RAM (ComfyUI issue #10896)
|
|
- ./patches/model_management.py:/opt/ComfyUI/comfy/model_management.py:ro
|
|
- ./patches/utils.py:/opt/ComfyUI/comfy/utils.py:ro
|
|
|
|
networks:
|
|
- sparky_net
|
|
|
|
# Health check - ComfyUI takes time to load, so generous start period
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8188/"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
start_period: 120s
|
|
retries: 3
|
|
|
|
restart: unless-stopped
|
|
|
|
# ComfyUIMini - Mobile-friendly UI
|
|
# Access at http://<host>:3000
|
|
comfyuimini:
|
|
build:
|
|
context: ./comfyuimini
|
|
dockerfile: Dockerfile
|
|
args:
|
|
COMFYUIMINI_REF: "${COMFYUIMINI_REF:-main}"
|
|
|
|
image: comfyuimini:latest
|
|
container_name: comfyuimini
|
|
|
|
ports:
|
|
- "${COMFYUIMINI_PORT:-3000}:3000"
|
|
|
|
environment:
|
|
# node-config override - connects to comfyui container via docker network
|
|
NODE_CONFIG: >-
|
|
{
|
|
"app_port": 3000,
|
|
"comfyui_url": "http://comfyui:8188",
|
|
"comfyui_ws_url": "ws://comfyui:8188",
|
|
"output_dir": "/shared/output",
|
|
"reject_unauthorised_cert": false
|
|
}
|
|
|
|
volumes:
|
|
# Share output directory with ComfyUI for gallery feature (read-only)
|
|
- ${SPARKYUI_DATA_PATH}/output:/shared/output:ro
|
|
# Persist server-side workflows
|
|
- comfyuimini_workflows:/app/workflows
|
|
|
|
networks:
|
|
- sparky_net
|
|
|
|
depends_on:
|
|
comfyui:
|
|
condition: service_healthy
|
|
|
|
restart: unless-stopped
|
|
|
|
networks:
|
|
sparky_net:
|
|
driver: bridge
|
|
|
|
volumes:
|
|
comfyuimini_workflows:
|