fix: revert intermediate_device to cpu for unified memory

intermediate_device() controls where large output tensors (decoded video
frames) are accumulated. On unified memory, cpu and cuda:0 share the same
physical RAM, but the CUDA allocator has different fragmentation behavior.

With intermediate_device=cuda:0, LTX video VAE decode hung because
tiled_scale_multidim allocates the full output tensor on cuda:0 upfront,
and the CUDA allocator can't efficiently reclaim space during tiled
decode. Reverting to cpu fixes the hang.

vae_offload_device() and text_encoder_offload_device() remain cuda:0
since those model-loading paths benefit from GPU allocation.
This commit is contained in:
Evan Carmen
2026-05-20 19:30:53 -05:00
parent 7e4d22e41c
commit 31939a9710
+1 -1
View File
@@ -1106,7 +1106,7 @@ def text_encoder_dtype(device=None):
def intermediate_device():
if args.gpu_only or UNIFIED_MEMORY:
if args.gpu_only:
return get_torch_device()
else:
return torch.device("cpu")