The mounted patches/model_management.py and patches/utils.py were authored
against an older ComfyUI, but COMFYUI_REF=master clones the latest. Upstream
added the DynamicVRAM/AIMDO system, and main.py now calls
model_management.get_all_torch_devices() (13 functions were missing in total),
causing comfyui to crash-loop on startup with AttributeError.
Regenerated both patches from the current master files and re-applied the
documented Sparky edits on top so they stay API-compatible:
- model_management.py: unified-memory detection, NORMAL_VRAM retention,
95% weight ratio, intermediate_device()->cuda, soft_empty_cache skip
- utils.py: copy=False tensor load on unified memory
comfyui now starts cleanly with DynamicVRAM enabled and the Sparky
unified-memory path active.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Major unified memory optimization changes:
1. model_management.py: HIGH_VRAM → NORMAL_VRAM
- GB10 unified memory: offloading to CPU doesn't save physical RAM
(same pool), but NORMAL_VRAM allows per-layer partial loading when
memory is tight instead of all-or-nothing OOM
- text_encoder_offload_device() and vae_offload_device() now return
CPU (allows ComfyUI to offload unused models)
- intermediate_device() still returns GPU (VAE outputs must stay in
CUDA allocator for honest memory tracking)
- User can force HIGH_VRAM with --highvram if models fit
2. utils.py: copy=True → copy=False for tensor.to(device)
- On GB10 unified memory, copy=True creates a full duplicate in both
CPU and CUDA allocators simultaneously (ComfyUI issue #10896)
- copy=False makes .to(device) a zero-copy device label change since
both allocators draw from the same physical LPDDR5X
- Halves model loading memory usage when --disable-mmap is set
3. Removed --disable-dynamic-vram from ComfyUI flags
- Was preventing AIMDO (comfy_aimdo) from initializing
- AIMDO now activates: VBAR-based page-level VRAM management at 32MB
granularity instead of blunt .to(cpu) copies
- Falls back to NORMAL_VRAM per-layer loading if AIMDO has issues
4. Added CUDA_CACHE_MAXSIZE=4294967296 (4GB kernel cache)
- PTX→SASS kernel caching for sm_121 (GB10 Blackwell)
- 3x speedup on subsequent runs reported by DGX Spark community
5. System: vm.swappiness reduced from 60 to 1
- Swap thrashing on unified memory causes silent system freezes
- Near-zero swappiness ensures clean OOM kills instead
On Grace-Blackwell (GB10), CPU and GPU share the same physical RAM.
intermediate_device() was returning 'cpu', which means ComfyUI allocates
output buffers (like VAE decode) through the CPU allocator on the same
physical memory pool it thinks is free VRAM. This causes:
1. Memory accounting mismatch — ComfyUI thinks intermediates are 'over
there' on CPU and overestimates available VRAM
2. Unnecessary .to(device) copies through separate allocator heaps
3. Heap fragmentation across the unified memory pool
Now matches text_encoder_offload_device() and vae_offload_device() which
already return get_torch_device() on UNIFIED_MEMORY.
intermediate_device() controls where large output tensors (decoded video
frames) are accumulated. On unified memory, cpu and cuda:0 share the same
physical RAM, but the CUDA allocator has different fragmentation behavior.
With intermediate_device=cuda:0, LTX video VAE decode hung because
tiled_scale_multidim allocates the full output tensor on cuda:0 upfront,
and the CUDA allocator can't efficiently reclaim space during tiled
decode. Reverting to cpu fixes the hang.
vae_offload_device() and text_encoder_offload_device() remain cuda:0
since those model-loading paths benefit from GPU allocation.