intermediate_device() controls where large output tensors (decoded video
frames) are accumulated. On unified memory, cpu and cuda:0 share the same
physical RAM, but the CUDA allocator has different fragmentation behavior.
With intermediate_device=cuda:0, LTX video VAE decode hung because
tiled_scale_multidim allocates the full output tensor on cuda:0 upfront,
and the CUDA allocator can't efficiently reclaim space during tiled
decode. Reverting to cpu fixes the hang.
vae_offload_device() and text_encoder_offload_device() remain cuda:0
since those model-loading paths benefit from GPU allocation.
- Add "What's Included" section listing ComfyUI, ComfyUI-Manager,
SageAttention, and PyTorch versions
- Update clone URL to actual GitHub repo (ecarmen16/SparkyUI)
- ComfyUI-Manager is auto-installed on first container run
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker-based ComfyUI setup for NVIDIA DGX Spark ARM64 + sm_121:
- CUDA 13.0.2 base (required for compute_121 support)
- PyTorch 2.9.1+cu130 ARM64 wheels
- SageAttention compiled with TORCH_CUDA_ARCH_LIST="12.1"
- Triton/torch.compile disabled (no sm_121 support yet)
- ComfyUI-Manager auto-installed at runtime
- Configurable model/data paths via .env
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>