feat: NORMAL_VRAM + AIMDO + copy=False patch + kernel caching

Major unified memory optimization changes:

1. model_management.py: HIGH_VRAM → NORMAL_VRAM
   - GB10 unified memory: offloading to CPU doesn't save physical RAM
     (same pool), but NORMAL_VRAM allows per-layer partial loading when
     memory is tight instead of all-or-nothing OOM
   - text_encoder_offload_device() and vae_offload_device() now return
     CPU (allows ComfyUI to offload unused models)
   - intermediate_device() still returns GPU (VAE outputs must stay in
     CUDA allocator for honest memory tracking)
   - User can force HIGH_VRAM with --highvram if models fit

2. utils.py: copy=True → copy=False for tensor.to(device)
   - On GB10 unified memory, copy=True creates a full duplicate in both
     CPU and CUDA allocators simultaneously (ComfyUI issue #10896)
   - copy=False makes .to(device) a zero-copy device label change since
     both allocators draw from the same physical LPDDR5X
   - Halves model loading memory usage when --disable-mmap is set

3. Removed --disable-dynamic-vram from ComfyUI flags
   - Was preventing AIMDO (comfy_aimdo) from initializing
   - AIMDO now activates: VBAR-based page-level VRAM management at 32MB
     granularity instead of blunt .to(cpu) copies
   - Falls back to NORMAL_VRAM per-layer loading if AIMDO has issues

4. Added CUDA_CACHE_MAXSIZE=4294967296 (4GB kernel cache)
   - PTX→SASS kernel caching for sm_121 (GB10 Blackwell)
   - 3x speedup on subsequent runs reported by DGX Spark community

5. System: vm.swappiness reduced from 60 to 1
   - Swap thrashing on unified memory causes silent system freezes
   - Near-zero swappiness ensures clean OOM kills instead
This commit is contained in:
Evan Carmen
2026-05-21 19:04:25 -05:00
parent c803ea6146
commit 6fa6c5041b
3 changed files with 1485 additions and 18 deletions
+15 -11
View File
@@ -37,18 +37,18 @@ services:
TORCH_COMPILE_DISABLE: "1" TORCH_COMPILE_DISABLE: "1"
TORCHDYNAMO_DISABLE: "1" TORCHDYNAMO_DISABLE: "1"
# Grace-Blackwell unified memory optimizations # Grace-Blackwell unified memory — removed aggressive CUDA tuning (5/21):
CUDA_CACHE_DISABLE: "1" # CUDA_CACHE_DISABLE, CUDA_DEVICE_MAX_CONNECTIONS, CUDA_DEVICE_MAX_COPY_CONNECTIONS,
# PYTORCH_NO_CUDA_MEMORY_CACHING removed — our model_management patch handles # CUDA_MODULE_LOADING=EAGER, CUDA_MANAGED_FORCE_DEVICE_ALLOC, OMP_NUM_THREADS
# caching properly by skipping empty_cache() on unified memory instead of disabling # These were over-tuning. The ComfyUI flags + Sparky patch handle the architecture.
# PyTorch's allocator entirely. Keeping caching ON reduces allocation overhead. # Keeping only CUBLAS_WORKSPACE_CONFIG for determinism.
CUDA_DEVICE_MAX_CONNECTIONS: "1"
CUDA_DEVICE_MAX_COPY_CONNECTIONS: "4"
CUDA_MODULE_LOADING: "EAGER"
CUDA_MANAGED_FORCE_DEVICE_ALLOC: "1"
OMP_NUM_THREADS: "20"
CUBLAS_WORKSPACE_CONFIG: ":0:0" CUBLAS_WORKSPACE_CONFIG: ":0:0"
# CUDA kernel caching — PTX→SASS compilation cache for GB10 (sm_121)
# First run compiles kernels, subsequent runs reuse from disk. 3x speedup reported.
# 4GB cache covers all typical ComfyUI kernel variants.
CUDA_CACHE_MAXSIZE: "4294967296"
volumes: volumes:
# Models from existing ComfyUI install (read-only) # Models from existing ComfyUI install (read-only)
- ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro - ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro
@@ -66,8 +66,12 @@ services:
- ${SPARKYUI_DATA_PATH}/wheels:/opt/wheels - ${SPARKYUI_DATA_PATH}/wheels:/opt/wheels
# Sparky patches - Grace-Blackwell unified memory optimizations # Sparky patches - Grace-Blackwell unified memory optimizations
# This overrides ComfyUI's model_management.py with our patched version # model_management.py: HIGH_VRAM→NORMAL_VRAM, intermediate_device()→cuda, soft_empty_cache skip,
# 95% vram_for_weights, UNIFIED_MEMORY detection, offload devices → cuda
# utils.py: copy=False on tensor.to(device) — avoids double-allocation on unified memory
# where CPU and GPU share the same physical RAM (ComfyUI issue #10896)
- ./patches/model_management.py:/opt/ComfyUI/comfy/model_management.py:ro - ./patches/model_management.py:/opt/ComfyUI/comfy/model_management.py:ro
- ./patches/utils.py:/opt/ComfyUI/comfy/utils.py:ro
networks: networks:
- sparky_net - sparky_net
+16 -7
View File
@@ -507,13 +507,22 @@ def _is_unified_memory():
UNIFIED_MEMORY = _is_unified_memory() UNIFIED_MEMORY = _is_unified_memory()
if UNIFIED_MEMORY: if UNIFIED_MEMORY:
# On unified memory, offloading to CPU is pointless (same physical chips) # On unified memory, NORMAL_VRAM allows ComfyUI to offload unused model
# HIGH_VRAM keeps everything on GPU and skips the offload/onload cycle # layers to CPU when memory is tight. Since CPU and GPU share the same
# physical RAM on GB10, offloaded layers stay in the same physical pool
# but through a different allocator. Per-layer partial loading (LowVramPatch)
# means only individual layers are copied on-demand, not whole models,
# keeping peak memory manageable.
# HIGH_VRAM is available via --highvram if everything fits in VRAM.
if not (args.highvram or args.gpu_only): if not (args.highvram or args.gpu_only):
logging.info("[Sparky] Grace-Blackwell unified memory detected — " logging.info("[Sparky] Grace-Blackwell unified memory detected — "
"setting HIGH_VRAM mode (no CPU offloading)") "keeping NORMAL_VRAM mode (allows layer offloading)")
vram_state = VRAMState.HIGH_VRAM else:
logging.info(f"[Sparky] Set vram state to: {vram_state.name} (unified memory override)") logging.info("[Sparky] Grace-Blackwell unified memory detected — "
"HIGH_VRAM requested via --highvram")
# Don't override vram_state — let ComfyUI's default NORMAL_VRAM handle
# offloading. User can force HIGH_VRAM with --highvram if models fit.
logging.info(f"[Sparky] Set vram state to: {vram_state.name} (unified memory)")
else: else:
logging.info(f"Set vram state to: {vram_state.name}") logging.info(f"Set vram state to: {vram_state.name}")
@@ -1054,7 +1063,7 @@ def unet_manual_cast(weight_dtype, inference_device, supported_dtypes=[torch.flo
return torch.float32 return torch.float32
def text_encoder_offload_device(): def text_encoder_offload_device():
if args.gpu_only or UNIFIED_MEMORY: if args.gpu_only:
return get_torch_device() return get_torch_device()
else: else:
return torch.device("cpu") return torch.device("cpu")
@@ -1123,7 +1132,7 @@ def vae_device():
return get_torch_device() return get_torch_device()
def vae_offload_device(): def vae_offload_device():
if args.gpu_only or UNIFIED_MEMORY: if args.gpu_only:
return get_torch_device() return get_torch_device()
else: else:
return torch.device("cpu") return torch.device("cpu")
+1454
View File
File diff suppressed because it is too large Load Diff