On Grace-Blackwell (GB10), CPU and GPU share the same physical RAM.
intermediate_device() was returning 'cpu', which means ComfyUI allocates
output buffers (like VAE decode) through the CPU allocator on the same
physical memory pool it thinks is free VRAM. This causes:
1. Memory accounting mismatch — ComfyUI thinks intermediates are 'over
there' on CPU and overestimates available VRAM
2. Unnecessary .to(device) copies through separate allocator heaps
3. Heap fragmentation across the unified memory pool
Now matches text_encoder_offload_device() and vae_offload_device() which
already return get_torch_device() on UNIFIED_MEMORY.
intermediate_device() controls where large output tensors (decoded video
frames) are accumulated. On unified memory, cpu and cuda:0 share the same
physical RAM, but the CUDA allocator has different fragmentation behavior.
With intermediate_device=cuda:0, LTX video VAE decode hung because
tiled_scale_multidim allocates the full output tensor on cuda:0 upfront,
and the CUDA allocator can't efficiently reclaim space during tiled
decode. Reverting to cpu fixes the hang.
vae_offload_device() and text_encoder_offload_device() remain cuda:0
since those model-loading paths benefit from GPU allocation.
- Add "What's Included" section listing ComfyUI, ComfyUI-Manager,
SageAttention, and PyTorch versions
- Update clone URL to actual GitHub repo (ecarmen16/SparkyUI)
- ComfyUI-Manager is auto-installed on first container run
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker-based ComfyUI setup for NVIDIA DGX Spark ARM64 + sm_121:
- CUDA 13.0.2 base (required for compute_121 support)
- PyTorch 2.9.1+cu130 ARM64 wheels
- SageAttention compiled with TORCH_CUDA_ARCH_LIST="12.1"
- Triton/torch.compile disabled (no sm_121 support yet)
- ComfyUI-Manager auto-installed at runtime
- Configurable model/data paths via .env
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>