feat: NORMAL_VRAM + AIMDO + copy=False patch + kernel caching
Major unified memory optimization changes:
1. model_management.py: HIGH_VRAM → NORMAL_VRAM
- GB10 unified memory: offloading to CPU doesn't save physical RAM
(same pool), but NORMAL_VRAM allows per-layer partial loading when
memory is tight instead of all-or-nothing OOM
- text_encoder_offload_device() and vae_offload_device() now return
CPU (allows ComfyUI to offload unused models)
- intermediate_device() still returns GPU (VAE outputs must stay in
CUDA allocator for honest memory tracking)
- User can force HIGH_VRAM with --highvram if models fit
2. utils.py: copy=True → copy=False for tensor.to(device)
- On GB10 unified memory, copy=True creates a full duplicate in both
CPU and CUDA allocators simultaneously (ComfyUI issue #10896)
- copy=False makes .to(device) a zero-copy device label change since
both allocators draw from the same physical LPDDR5X
- Halves model loading memory usage when --disable-mmap is set
3. Removed --disable-dynamic-vram from ComfyUI flags
- Was preventing AIMDO (comfy_aimdo) from initializing
- AIMDO now activates: VBAR-based page-level VRAM management at 32MB
granularity instead of blunt .to(cpu) copies
- Falls back to NORMAL_VRAM per-layer loading if AIMDO has issues
4. Added CUDA_CACHE_MAXSIZE=4294967296 (4GB kernel cache)
- PTX→SASS kernel caching for sm_121 (GB10 Blackwell)
- 3x speedup on subsequent runs reported by DGX Spark community
5. System: vm.swappiness reduced from 60 to 1
- Swap thrashing on unified memory causes silent system freezes
- Near-zero swappiness ensures clean OOM kills instead
This commit is contained in:
+15
-11
@@ -37,18 +37,18 @@ services:
|
||||
TORCH_COMPILE_DISABLE: "1"
|
||||
TORCHDYNAMO_DISABLE: "1"
|
||||
|
||||
# Grace-Blackwell unified memory optimizations
|
||||
CUDA_CACHE_DISABLE: "1"
|
||||
# PYTORCH_NO_CUDA_MEMORY_CACHING removed — our model_management patch handles
|
||||
# caching properly by skipping empty_cache() on unified memory instead of disabling
|
||||
# PyTorch's allocator entirely. Keeping caching ON reduces allocation overhead.
|
||||
CUDA_DEVICE_MAX_CONNECTIONS: "1"
|
||||
CUDA_DEVICE_MAX_COPY_CONNECTIONS: "4"
|
||||
CUDA_MODULE_LOADING: "EAGER"
|
||||
CUDA_MANAGED_FORCE_DEVICE_ALLOC: "1"
|
||||
OMP_NUM_THREADS: "20"
|
||||
# Grace-Blackwell unified memory — removed aggressive CUDA tuning (5/21):
|
||||
# CUDA_CACHE_DISABLE, CUDA_DEVICE_MAX_CONNECTIONS, CUDA_DEVICE_MAX_COPY_CONNECTIONS,
|
||||
# CUDA_MODULE_LOADING=EAGER, CUDA_MANAGED_FORCE_DEVICE_ALLOC, OMP_NUM_THREADS
|
||||
# These were over-tuning. The ComfyUI flags + Sparky patch handle the architecture.
|
||||
# Keeping only CUBLAS_WORKSPACE_CONFIG for determinism.
|
||||
CUBLAS_WORKSPACE_CONFIG: ":0:0"
|
||||
|
||||
# CUDA kernel caching — PTX→SASS compilation cache for GB10 (sm_121)
|
||||
# First run compiles kernels, subsequent runs reuse from disk. 3x speedup reported.
|
||||
# 4GB cache covers all typical ComfyUI kernel variants.
|
||||
CUDA_CACHE_MAXSIZE: "4294967296"
|
||||
|
||||
volumes:
|
||||
# Models from existing ComfyUI install (read-only)
|
||||
- ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro
|
||||
@@ -66,8 +66,12 @@ services:
|
||||
- ${SPARKYUI_DATA_PATH}/wheels:/opt/wheels
|
||||
|
||||
# Sparky patches - Grace-Blackwell unified memory optimizations
|
||||
# This overrides ComfyUI's model_management.py with our patched version
|
||||
# model_management.py: HIGH_VRAM→NORMAL_VRAM, intermediate_device()→cuda, soft_empty_cache skip,
|
||||
# 95% vram_for_weights, UNIFIED_MEMORY detection, offload devices → cuda
|
||||
# utils.py: copy=False on tensor.to(device) — avoids double-allocation on unified memory
|
||||
# where CPU and GPU share the same physical RAM (ComfyUI issue #10896)
|
||||
- ./patches/model_management.py:/opt/ComfyUI/comfy/model_management.py:ro
|
||||
- ./patches/utils.py:/opt/ComfyUI/comfy/utils.py:ro
|
||||
|
||||
networks:
|
||||
- sparky_net
|
||||
|
||||
Reference in New Issue
Block a user