SparkyUI

TBNilles/SparkyUI

Fork 0

Commit Graph

Author	SHA1	Message	Date
Evan Carmen	c803ea6146	fix: intermediate_device() returns cuda on unified memory On Grace-Blackwell (GB10), CPU and GPU share the same physical RAM. intermediate_device() was returning 'cpu', which means ComfyUI allocates output buffers (like VAE decode) through the CPU allocator on the same physical memory pool it thinks is free VRAM. This causes: 1. Memory accounting mismatch — ComfyUI thinks intermediates are 'over there' on CPU and overestimates available VRAM 2. Unnecessary .to(device) copies through separate allocator heaps 3. Heap fragmentation across the unified memory pool Now matches text_encoder_offload_device() and vae_offload_device() which already return get_torch_device() on UNIFIED_MEMORY.	2026-05-21 11:02:06 -05:00
Evan Carmen	31939a9710	fix: revert intermediate_device to cpu for unified memory intermediate_device() controls where large output tensors (decoded video frames) are accumulated. On unified memory, cpu and cuda:0 share the same physical RAM, but the CUDA allocator has different fragmentation behavior. With intermediate_device=cuda:0, LTX video VAE decode hung because tiled_scale_multidim allocates the full output tensor on cuda:0 upfront, and the CUDA allocator can't efficiently reclaim space during tiled decode. Reverting to cpu fixes the hang. vae_offload_device() and text_encoder_offload_device() remain cuda:0 since those model-loading paths benefit from GPU allocation.	2026-05-20 19:30:53 -05:00
Evan Carmen	7e4d22e41c	feat: Grace-Blackwell unified memory optimization for ComfyUI - Add model_management.py patch: detects GB10 unified memory (VRAM ≈ RAM > 0.95) - Set HIGH_VRAM mode: no pointless CPU offloading (same physical memory pool) - Increase maximum_vram_for_weights from 88% to 95% (8.4GB headroom on 128GB) - Skip torch.cuda.empty_cache() on unified memory (avoids page faults) - Return GPU for text_encoder/vae/intermediate offload devices on unified memory - MPS excluded from unified detection (has its own SHARED state) - Remove PYTORCH_NO_CUDA_MEMORY_CACHING env var (patch handles caching properly) - Mount patched file as read-only volume override in docker-compose.yml - DeepSeek review: safe and correct for DGX Spark target Co-authored-by: DeepSeek (code review)	2026-05-20 16:01:51 -05:00

Author

SHA1

Message

Date

Evan Carmen

c803ea6146

fix: intermediate_device() returns cuda on unified memory

On Grace-Blackwell (GB10), CPU and GPU share the same physical RAM.
intermediate_device() was returning 'cpu', which means ComfyUI allocates
output buffers (like VAE decode) through the CPU allocator on the same
physical memory pool it thinks is free VRAM. This causes:

1. Memory accounting mismatch — ComfyUI thinks intermediates are 'over
   there' on CPU and overestimates available VRAM
2. Unnecessary .to(device) copies through separate allocator heaps
3. Heap fragmentation across the unified memory pool

Now matches text_encoder_offload_device() and vae_offload_device() which
already return get_torch_device() on UNIFIED_MEMORY.

2026-05-21 11:02:06 -05:00

Evan Carmen

31939a9710

fix: revert intermediate_device to cpu for unified memory

intermediate_device() controls where large output tensors (decoded video
frames) are accumulated. On unified memory, cpu and cuda:0 share the same
physical RAM, but the CUDA allocator has different fragmentation behavior.

With intermediate_device=cuda:0, LTX video VAE decode hung because
tiled_scale_multidim allocates the full output tensor on cuda:0 upfront,
and the CUDA allocator can't efficiently reclaim space during tiled
decode. Reverting to cpu fixes the hang.

vae_offload_device() and text_encoder_offload_device() remain cuda:0
since those model-loading paths benefit from GPU allocation.

2026-05-20 19:30:53 -05:00

Evan Carmen

7e4d22e41c

feat: Grace-Blackwell unified memory optimization for ComfyUI

- Add model_management.py patch: detects GB10 unified memory (VRAM ≈ RAM > 0.95)
- Set HIGH_VRAM mode: no pointless CPU offloading (same physical memory pool)
- Increase maximum_vram_for_weights from 88% to 95% (8.4GB headroom on 128GB)
- Skip torch.cuda.empty_cache() on unified memory (avoids page faults)
- Return GPU for text_encoder/vae/intermediate offload devices on unified memory
- MPS excluded from unified detection (has its own SHARED state)
- Remove PYTORCH_NO_CUDA_MEMORY_CACHING env var (patch handles caching properly)
- Mount patched file as read-only volume override in docker-compose.yml
- DeepSeek review: safe and correct for DGX Spark target

Co-authored-by: DeepSeek (code review)

2026-05-20 16:01:51 -05:00

3 Commits