feat: NORMAL_VRAM + AIMDO + copy=False patch + kernel caching
Major unified memory optimization changes:
1. model_management.py: HIGH_VRAM → NORMAL_VRAM
- GB10 unified memory: offloading to CPU doesn't save physical RAM
(same pool), but NORMAL_VRAM allows per-layer partial loading when
memory is tight instead of all-or-nothing OOM
- text_encoder_offload_device() and vae_offload_device() now return
CPU (allows ComfyUI to offload unused models)
- intermediate_device() still returns GPU (VAE outputs must stay in
CUDA allocator for honest memory tracking)
- User can force HIGH_VRAM with --highvram if models fit
2. utils.py: copy=True → copy=False for tensor.to(device)
- On GB10 unified memory, copy=True creates a full duplicate in both
CPU and CUDA allocators simultaneously (ComfyUI issue #10896)
- copy=False makes .to(device) a zero-copy device label change since
both allocators draw from the same physical LPDDR5X
- Halves model loading memory usage when --disable-mmap is set
3. Removed --disable-dynamic-vram from ComfyUI flags
- Was preventing AIMDO (comfy_aimdo) from initializing
- AIMDO now activates: VBAR-based page-level VRAM management at 32MB
granularity instead of blunt .to(cpu) copies
- Falls back to NORMAL_VRAM per-layer loading if AIMDO has issues
4. Added CUDA_CACHE_MAXSIZE=4294967296 (4GB kernel cache)
- PTX→SASS kernel caching for sm_121 (GB10 Blackwell)
- 3x speedup on subsequent runs reported by DGX Spark community
5. System: vm.swappiness reduced from 60 to 1
- Swap thrashing on unified memory causes silent system freezes
- Near-zero swappiness ensures clean OOM kills instead
This commit is contained in:
@@ -507,13 +507,22 @@ def _is_unified_memory():
|
||||
UNIFIED_MEMORY = _is_unified_memory()
|
||||
|
||||
if UNIFIED_MEMORY:
|
||||
# On unified memory, offloading to CPU is pointless (same physical chips)
|
||||
# HIGH_VRAM keeps everything on GPU and skips the offload/onload cycle
|
||||
# On unified memory, NORMAL_VRAM allows ComfyUI to offload unused model
|
||||
# layers to CPU when memory is tight. Since CPU and GPU share the same
|
||||
# physical RAM on GB10, offloaded layers stay in the same physical pool
|
||||
# but through a different allocator. Per-layer partial loading (LowVramPatch)
|
||||
# means only individual layers are copied on-demand, not whole models,
|
||||
# keeping peak memory manageable.
|
||||
# HIGH_VRAM is available via --highvram if everything fits in VRAM.
|
||||
if not (args.highvram or args.gpu_only):
|
||||
logging.info("[Sparky] Grace-Blackwell unified memory detected — "
|
||||
"setting HIGH_VRAM mode (no CPU offloading)")
|
||||
vram_state = VRAMState.HIGH_VRAM
|
||||
logging.info(f"[Sparky] Set vram state to: {vram_state.name} (unified memory override)")
|
||||
"keeping NORMAL_VRAM mode (allows layer offloading)")
|
||||
else:
|
||||
logging.info("[Sparky] Grace-Blackwell unified memory detected — "
|
||||
"HIGH_VRAM requested via --highvram")
|
||||
# Don't override vram_state — let ComfyUI's default NORMAL_VRAM handle
|
||||
# offloading. User can force HIGH_VRAM with --highvram if models fit.
|
||||
logging.info(f"[Sparky] Set vram state to: {vram_state.name} (unified memory)")
|
||||
else:
|
||||
logging.info(f"Set vram state to: {vram_state.name}")
|
||||
|
||||
@@ -1054,7 +1063,7 @@ def unet_manual_cast(weight_dtype, inference_device, supported_dtypes=[torch.flo
|
||||
return torch.float32
|
||||
|
||||
def text_encoder_offload_device():
|
||||
if args.gpu_only or UNIFIED_MEMORY:
|
||||
if args.gpu_only:
|
||||
return get_torch_device()
|
||||
else:
|
||||
return torch.device("cpu")
|
||||
@@ -1123,7 +1132,7 @@ def vae_device():
|
||||
return get_torch_device()
|
||||
|
||||
def vae_offload_device():
|
||||
if args.gpu_only or UNIFIED_MEMORY:
|
||||
if args.gpu_only:
|
||||
return get_torch_device()
|
||||
else:
|
||||
return torch.device("cpu")
|
||||
|
||||
+1454
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user