fix: revert intermediate_device to cpu for unified memory
intermediate_device() controls where large output tensors (decoded video frames) are accumulated. On unified memory, cpu and cuda:0 share the same physical RAM, but the CUDA allocator has different fragmentation behavior. With intermediate_device=cuda:0, LTX video VAE decode hung because tiled_scale_multidim allocates the full output tensor on cuda:0 upfront, and the CUDA allocator can't efficiently reclaim space during tiled decode. Reverting to cpu fixes the hang. vae_offload_device() and text_encoder_offload_device() remain cuda:0 since those model-loading paths benefit from GPU allocation.
This commit is contained in:
@@ -1106,7 +1106,7 @@ def text_encoder_dtype(device=None):
|
|||||||
|
|
||||||
|
|
||||||
def intermediate_device():
|
def intermediate_device():
|
||||||
if args.gpu_only or UNIFIED_MEMORY:
|
if args.gpu_only:
|
||||||
return get_torch_device()
|
return get_torch_device()
|
||||||
else:
|
else:
|
||||||
return torch.device("cpu")
|
return torch.device("cpu")
|
||||||
|
|||||||
Reference in New Issue
Block a user