Files

T

Evan Carmen 7e4d22e41c feat: Grace-Blackwell unified memory optimization for ComfyUI

- Add model_management.py patch: detects GB10 unified memory (VRAM ≈ RAM > 0.95)
- Set HIGH_VRAM mode: no pointless CPU offloading (same physical memory pool)
- Increase maximum_vram_for_weights from 88% to 95% (8.4GB headroom on 128GB)
- Skip torch.cuda.empty_cache() on unified memory (avoids page faults)
- Return GPU for text_encoder/vae/intermediate offload devices on unified memory
- MPS excluded from unified detection (has its own SHARED state)
- Remove PYTORCH_NO_CUDA_MEMORY_CACHING env var (patch handles caching properly)
- Mount patched file as read-only volume override in docker-compose.yml
- DeepSeek review: safe and correct for DGX Spark target

Co-authored-by: DeepSeek (code review)

2026-05-20 16:01:51 -05:00

14 KiB

Raw Permalink Blame History

Review: Grace-Blackwell Unified Memory Optimization Patch

Date: 2026-05-20 Reviewed files:

patches/model_management.py (1903 lines, patched)
docker-compose.yml (environment variables, volume mount)

Scope: The patch modifies 3 locations in ComfyUI's model_management.py:

Unified memory detection + HIGH_VRAM override (new, lines 462–515)
maximum_vram_for_weights() — 95% instead of 88% (lines 968–974)
soft_empty_cache() — skip empty_cache() on unified memory (lines 1847–1854)

The patched file is a direct descendant of upstream ComfyUI (identical line count: 1903), with only these 3 changes applied. The docker-compose mounts it as a read-only volume override.

CRITICAL

C1. MPS `vram_state` override — SHARED silently replaced by HIGH_VRAM

File: patches/model_management.py, lines 460 vs 506–512

Problem: The execution order is:

line 460:  if cpu_state == CPUState.MPS: vram_state = VRAMState.SHARED
line 475:  _is_unified_memory() returns True for MPS
line 504:  UNIFIED_MEMORY = True
line 512:  vram_state = VRAMState.HIGH_VRAM          ← SHARED overwritten!

MPS (Apple Silicon) already has its own unified memory path (VRAMState.SHARED, documented as "No dedicated vram: memory shared between CPU and GPU but models still need to be moved between both"). The patch overrides this with HIGH_VRAM, which has different semantics.

Impact in this project: None. MPS is unreachable in the Sparky Docker container (ARM64 Linux + NVIDIA CUDA).

Impact if upstreamed: This would break Apple Silicon behavior. SHARED and HIGH_VRAM are distinct states — for example, unet_offload_device() returns GPU for HIGH_VRAM but CPU for SHARED (line 942).

Fix (2 options):

Option A (preferred): Remove MPS from _is_unified_memory():

if cpu_state == CPUState.MPS:
    return False  # MPS already handles unified memory via VRAMState.SHARED

Option B: Guard the override:

if UNIFIED_MEMORY and cpu_state != CPUState.MPS:
    vram_state = VRAMState.HIGH_VRAM

C2. MPS `soft_empty_cache()` — `torch.mps.empty_cache()` silently skipped

File: patches/model_management.py, lines 1845–1858

Problem: On MPS, the execution flow is:

line 1846:  cpu_mode() → False (MPS ≠ CPU)
line 1851:  UNIFIED_MEMORY and not force → True (MPS detected as unified)
line 1854:  return          ← EXITS HERE
line 1857:  cpu_state == CPUState.MPS → NEVER REACHED
line 1858:  torch.mps.empty_cache() → NEVER CALLED

The unified memory early return at line 1854 sits before the MPS branch at line 1857, so torch.mps.empty_cache() is dead code when _is_unified_memory() returns True for MPS.

Impact in this project: None. MPS unreachable in CUDA container.

Fix: Same as C1. Removing MPS from _is_unified_memory() fixes both issues. Alternatively, move the MPS check before the unified memory check in soft_empty_cache():

if cpu_state == CPUState.MPS:
    torch.mps.empty_cache()
    return
if UNIFIED_MEMORY and not force:
    ...

WARNING

W1. Ratio-based detection false positive: VRAM > RAM on high-end discrete GPUs

Scenario: A system with GPU VRAM exceeding system RAM, e.g.:

A100 80GB + 64GB system RAM → ratio = 1.25 > 0.95
H100 80GB + 64GB system RAM → ratio = 1.25 > 0.95

The patch would classify these as unified memory and set HIGH_VRAM. However, in this case HIGH_VRAM is actually desirable — with more VRAM than RAM, you never want CPU offloading because models won't fit in system RAM. So this false positive has benign consequences.

False positive that would be harmful: A system where VRAM ≈ RAM but they are physically separate pools. Example: 48GB GPU + 48GB system RAM = ratio 1.0. With 48GB discrete VRAM, if a model is 46GB, you'd want CPU offloading for other models. The patch would prevent that. However, such configurations are essentially nonexistent in practice — discrete GPU VRAM is almost always substantially smaller than system RAM.

Mitigation: The device name string check ('grace', 'gb10', 'gb200') provides an explicit detection path that would catch true Grace-Blackwell even if the ratio check missed.

Risk: Low. True false positives (harmful ones) require VRAM ≈ RAM on a non-unified system, which is rare.

W2. `maximum_vram_for_weights()` reserves only 2GB — borderline for large batch inference

File: patches/model_management.py, lines 968–974

if UNIFIED_MEMORY:
    return (get_total_memory(device) * 0.95 - 2 * 1024 * 1024 * 1024)

On the 128GB Spark, this yields ~119.6GB available for weights. The remaining ~8.4GB (128 × 0.05 + 2GB) covers:

OS memory (kernel, system services)
Docker container overhead
PyTorch runtime (autocast buffers, CUDA contexts)
Intermediate tensors during inference (activations, attention maps)
VAE decode buffers

For typical ComfyUI workflows (SDXL, Flux, LTX-2), 8.4GB of headroom is generous. However, for extreme batch sizes or high-resolution video models that allocate large intermediate tensors, this could be tight.

Comparison to upstream: The upstream formula is total * 0.88 - minimum_inference_memory() where minimum_inference_memory() = 0.8GB + extra_reserved_memory() (0.4GB default) = ~1.2GB. On a discrete 24GB card: 240.88 - 1.2 = ~19.9GB. The patched formula on 128GB: 1280.95 - 2 = ~119.6GB. The patch uses a larger fraction (95% vs 88%) which is appropriate for unified memory — there's no separate "VRAM pool" to protect.

Recommendation: Consider making the 2GB reserve configurable, or basing it on a fraction of total memory rather than a fixed value. 2GB is 1.5% of 128GB, which is reasonable, but if someone ran this on a hypothetical 32GB Grace-Blackwell system, 2GB would be 6.25%.

W3. `'grace'` substring in device name is broad

File: patches/model_management.py, line 493

is_gb = 'gb10' in device_name or 'gb200' in device_name or 'grace' in device_name

The substring 'grace' could theoretically match future unrelated NVIDIA hardware with "grace" in the marketing name. This is unlikely to cause problems because:

If it matches erroneously, it would set HIGH_VRAM, which is generally fine
The detection only matters for unified memory systems
If NVIDIA releases a non-unified "Grace-something" GPU, it would have discrete VRAM < system RAM, so it would also need explicit detection

Risk: Very low. The ratio check (> 0.95) acts as a second factor for any name-based match.

W4. Some offload functions still return CPU on unified memory

Functions affected: text_encoder_offload_device() (line 1053), vae_offload_device() (line 1122), intermediate_device() (line 1105)

These functions check args.gpu_only rather than vram_state, so they still return CPU on unified memory unless --gpu-only is passed. The docker-compose explicitly does NOT pass --gpu-only (line 32 comment: "DON'T use --gpu-only - let the unified memory fabric work naturally").

In practice, this is harmless for correctness because:

text_encoder_device() (line 1059) correctly returns GPU under HIGH_VRAM
The offload device only matters when models are unloaded, and on unified memory "CPU offload" is just a pointer/address space change, not a physical memory copy

But it means models could ping-pong between CPU and GPU address spaces unnecessarily for offload/reload cycles.

Recommendation: Consider making these functions aware of UNIFIED_MEMORY so they return GPU when appropriate.

INFO

I1. Detection heuristics are sound for the target platform

On the DGX Spark (GB10, 128GB unified memory):

torch.cuda.get_device_properties(0).total_memory reports ~128GB
psutil.virtual_memory().total reports ~128GB
ratio = 128/128 ≈ 1.0 > 0.95 → unified memory detected
Device name contains "GB10" → name-based detection also triggers

Both detection paths independently confirm unified memory. The ratio check acts as a generic fallback; the name check provides explicit targeting.

I2. HIGH_VRAM mode behavior chains are correct for unified memory

With vram_state = VRAMState.HIGH_VRAM:

Function	Behavior	Correct for Unified Memory?
`unet_offload_device()`	Returns GPU	Yes
`unet_inital_load_device()`	Returns GPU (HIGH_VRAM in check)	Yes
`text_encoder_device()`	Returns GPU (HIGH_VRAM in check)	Yes
`free_memory()` soft_empty_cache guard	Skips cache flush (HIGH_VRAM guard)	Yes
`load_models_gpu()` lowvram skip	Skips lowvram logic (HIGH_VRAM guard)	Yes

All the primary memory management paths correctly keep models on GPU when HIGH_VRAM is set. The unet_inital_load_device() function (line 953) already treats HIGH_VRAM and SHARED identically, confirming the design intent.

I3. `soft_empty_cache()` change is well-motivated but has subtle callers

File: patches/model_management.py, lines 1845–1854

The patch skips torch.cuda.empty_cache() + torch.cuda.ipc_collect() on unified memory because releasing PyTorch's cached allocator blocks back to the OS causes page faults when PyTorch re-allocates from the same physical pool.

The 5 call sites of soft_empty_cache():

Call site	Line	Guarded by HIGH_VRAM?	OK to skip empty_cache?
`free_memory()` after unloading	758	Yes (line 760)	Yes — won't reach this path
`free_memory()` torch free > 25%	763	Yes (line 760)	Yes — won't reach this path
`cleanup_models_gc()` leak detect	902	No	Yes — rare, diagnostic path
`get_cast_buffer()` >50MB buffer	1266	No	Ok — allocator reuses memory
`reset_cast_buffers()` streams	1296	No	Ok — allocator reuses memory

The two unguarded callers in get_cast_buffer() and reset_cast_buffers() are fine — they delete old buffers and want PyTorch to reuse the memory, which the caching allocator will do internally without releasing to the OS.

I4. Docker compose environment variables are consistent with the patch

File: docker-compose.yml, lines 40–50

Variable	Value	Rationale
`CUDA_CACHE_DISABLE`	`1`	Disable CUDA kernel cache (unified memory doesn't benefit)
`PYTORCH_NO_CUDA_MEMORY_CACHING`	Removed	Patch handles caching properly; keeping PyTorch allocator ON is more efficient
`CUDA_DEVICE_MAX_CONNECTIONS`	`1`	Single GPU, no multi-device contention
`CUDA_MANAGED_FORCE_DEVICE_ALLOC`	`1`	Force CUDA managed memory to device allocation
`OMP_NUM_THREADS`	`20`	Matches 20-core ARM CPU
`CUBLAS_WORKSPACE_CONFIG`	`:0:0`	Minimal workspace for cuBLAS

The removal of PYTORCH_NO_CUDA_MEMORY_CACHING=1 is correct — that env var disables PyTorch's caching allocator entirely, which would cause massive allocation overhead. The patched soft_empty_cache() achieves the same goal (not releasing to OS) without the cost of disabling the allocator.

The --disable-pinned-memory flag (line 32) is correct for unified memory — pinning is unnecessary when GPU and CPU share physical memory.

I5. No `is_wsl()` unified memory check — probably fine

WSL2 with GPU-PV (GPU Paravirtualization) can report VRAM and RAM in ways that might look like unified memory, but the ratio would still be VRAM < RAM (WSL2 reports the physical GPU's VRAM, not unified). No action needed.

I6. Volume mount strategy

The read-only volume mount at line 70 (./patches/model_management.py:/opt/ComfyUI/comfy/model_management.py:ro) is clean — it overlays the patch without modifying the container image. This means:

docker compose down && docker compose up picks up patch changes
docker compose build is not needed for patch iteration
The patch is source-controlled in the Sparky repo

Other Functions That COULD Be Patched (Not Required)

These are opportunities for further optimization, not bugs:

`text_encoder_offload_device()` (line 1053)

Returns CPU unless --gpu-only. On unified memory, offloading to CPU is pointless. Could check UNIFIED_MEMORY and return GPU.

`vae_offload_device()` (line 1122)

Same issue — returns CPU unless --gpu-only. VAE offload to CPU is a no-op on unified memory.

`intermediate_device()` (line 1105)

Returns CPU unless --gpu-only. Intermediate tensors should stay on GPU in unified memory.

`PINNED_MEMORY` / `MAX_PINNED_MEMORY` (lines 1395–1404)

Pinned memory is pointless on unified memory — CPU and GPU share physical pages. The --disable-pinned-memory flag handles this at the CLI level, but the code still allocates MAX_PINNED_MEMORY. Could add a guard.

Summary

Severity	Count	Actionable in this project?
CRITICAL	2	No — MPS unreachable in CUDA container
WARNING	4	W2 (2GB reserve) worth monitoring
INFO	6	I4 env vars well-configured

Verdict: The patch is safe and correct for the DGX Spark target. The CRITICAL items are MPS regressions that are unreachable in this Docker container but would need fixing before upstreaming. The ratio-based detection heuristic is robust — the > 0.95 threshold effectively filters all real-world discrete GPU configurations. The soft_empty_cache() change and HIGH_VRAM override are the right approach for unified memory.

Recommended fixes before upstreaming:

Remove MPS from _is_unified_memory() (fixes both C1 and C2)
Consider making the 2GB reserve in maximum_vram_for_weights() a fraction of total memory
Optionally add UNIFIED_MEMORY awareness to text_encoder_offload_device(), vae_offload_device(), and intermediate_device()

14 KiB Raw Permalink Blame History Unescape Escape

Review: Grace-Blackwell Unified Memory Optimization Patch

CRITICAL

C1. MPS vram_state override — SHARED silently replaced by HIGH_VRAM

C2. MPS soft_empty_cache() — torch.mps.empty_cache() silently skipped