Evan Carmen c803ea6146 fix: intermediate_device() returns cuda on unified memory
On Grace-Blackwell (GB10), CPU and GPU share the same physical RAM.
intermediate_device() was returning 'cpu', which means ComfyUI allocates
output buffers (like VAE decode) through the CPU allocator on the same
physical memory pool it thinks is free VRAM. This causes:

1. Memory accounting mismatch — ComfyUI thinks intermediates are 'over
   there' on CPU and overestimates available VRAM
2. Unnecessary .to(device) copies through separate allocator heaps
3. Heap fragmentation across the unified memory pool

Now matches text_encoder_offload_device() and vae_offload_device() which
already return get_torch_device() on UNIFIED_MEMORY.
2026-05-21 11:02:06 -05:00

SparkyUI

ComfyUI + SageAttention for NVIDIA DGX Spark (Blackwell GB10)

A Docker-based ComfyUI setup specifically engineered for the DGX Spark's unique ARM64 + Blackwell architecture.

Why This Exists

The NVIDIA DGX Spark uses the GB10 GPU with compute capability 12.1 (sm_121) - Blackwell architecture. This creates challenges:

CUDA Version Max Compute Capability Can compile for GB10?
CUDA 12.8 sm_120 No
CUDA 13.0+ sm_121 Yes

Standard ComfyUI containers and PyTorch wheels don't support sm_121. SparkyUI solves this by:

  1. Using CUDA 13.0.2 base image (supports sm_121)
  2. Installing PyTorch cu130 ARM64 wheels
  3. Compiling SageAttention with TORCH_CUDA_ARCH_LIST="12.1"
  4. Disabling Triton/torch.compile (doesn't support sm_121 yet)
  5. Optimized for Grace-Blackwell unified memory architecture

What's Included

  • ComfyUI (latest master branch)
  • ComfyUI-Manager - auto-installed on first run for easy custom node management
  • ComfyUIMini - mobile-friendly web UI for phones/tablets (separate container)
  • SageAttention - compiled natively for sm_121 (Blackwell tensor cores)
  • PyTorch 2.9.1+cu130 - ARM64 wheels with CUDA 13.0 support

Unified Memory Architecture

The DGX Spark's Grace-Blackwell architecture uses unified memory - a coherent memory fabric shared between CPU and GPU. This is fundamentally different from discrete GPUs and requires different optimization strategies.

Key insight: Don't fight the fabric. Forcing everything GPU-side (--gpu-only, --cache-none) actually hurts performance.

Optimized flags (default in SparkyUI):

--disable-pinned-memory   # Reduces overhead on unified fabric
--force-fp16              # Enables SageAttention optimization
--fp16-unet --fp16-vae --fp16-text-enc  # FP16 precision throughout
--dont-upcast-attention   # Keeps attention in FP16 for speed

What NOT to use:

  • --gpu-only - fights the unified memory fabric, hurts performance
  • --cache-none - disables natural caching, slows model loading
  • --disable-mmap - prevents memory-mapped model loading

CUDA environment variables are also tuned for unified memory:

  • CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 - prefer GPU allocation
  • PYTORCH_NO_CUDA_MEMORY_CACHING=1 - let fabric manage memory
  • OMP_NUM_THREADS=20 - utilize all 20 ARM cores

Quick Start

# Clone
git clone https://github.com/ecarmen16/SparkyUI.git
cd SparkyUI

# Configure paths
cp .env.example .env
# Edit .env with your paths

# Build (compiles SageAttention for sm_121 - takes ~10 min)
docker compose build

# Start
docker compose up -d

# View logs
docker compose logs -f

Access:

Requirements

  • NVIDIA DGX Spark (or other GB10-based system)
  • Docker with NVIDIA Container Toolkit
  • NVIDIA Driver 560+ (tested with 580.95)
  • ~15GB disk for Docker image
  • Models from existing ComfyUI install (mounted read-only)

Configuration

Copy .env.example to .env and edit:

# Path to your existing ComfyUI models (mounted read-only)
COMFYUI_HOST_PATH=/path/to/your/ComfyUI

# Path for SparkyUI data (custom_nodes, outputs, inputs)
SPARKYUI_DATA_PATH=/path/to/SparkyUI

# Optional: pin to specific versions
COMFYUI_REF=master
SAGEATTN_REF=main

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        DGX Spark Host                             │
│  Ubuntu 24.04 (DGX OS 7) / Driver 580.x                          │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                    Docker Network (sparky_net)              │  │
│  │                                                             │  │
│  │  ┌─────────────────────────┐  ┌──────────────────────────┐ │  │
│  │  │  comfyui (sparkyui:cu130)│  │  comfyuimini (node:20)   │ │  │
│  │  │                         │  │                          │ │  │
│  │  │  CUDA 13.0.2 + PyTorch  │◄─┤  Mobile-friendly UI      │ │  │
│  │  │  SageAttention (sm_121) │  │  REST + WebSocket proxy  │ │  │
│  │  │  ComfyUI + Manager      │  │                          │ │  │
│  │  │                         │  │  Shares /output volume   │ │  │
│  │  └───────────┬─────────────┘  └────────────┬─────────────┘ │  │
│  │              │                             │                │  │
│  └──────────────┼─────────────────────────────┼────────────────┘  │
│                 │                             │                    │
│          Port 8188 (Desktop)           Port 3000 (Mobile)         │
└──────────────────────────────────────────────────────────────────┘

Version Compatibility

Tested combinations:

Component Version Notes
CUDA Base 13.0.2 Required for sm_121
PyTorch 2.9.1+cu130 ARM64 wheel from PyTorch index
torchvision 0.24.1+cu130 ARM64 wheel
SageAttention 2.2.0 Compiled with sm_121
ComfyUI 0.7.0 master branch
Driver 580.95 DGX OS 7 default

Known Limitations

  1. PyTorch Warning: You'll see a warning about compute capability 12.1 being "outside supported range (8.0-12.0)". This is harmless - PyTorch works, and SageAttention's custom kernels are compiled natively.

  2. torch.compile Disabled: Triton doesn't support sm_121 yet. torch.compile() is disabled via environment variables. Some nodes may run slower than on supported architectures.

  3. No GitHub Actions CI: Can't build for ARM64 + sm_121 in GitHub's hosted runners. Must build locally on DGX Spark.

Troubleshooting

"no kernel image is available for execution on the device"

Your SageAttention wasn't compiled for sm_121. Rebuild:

docker compose build --no-cache

PyTorch can't find CUDA

Ensure NVIDIA Container Toolkit is installed:

nvidia-ctk --version
docker run --rm --gpus all nvidia/cuda:13.0.2-base-ubuntu24.04 nvidia-smi

ComfyUI-Manager missing

The entrypoint auto-clones it. Check logs:

docker compose logs | grep -i manager

Host-Level GPU Optimizations (Optional)

For maximum performance, apply these optimizations on the host (not in Docker):

# Lock GPU clocks to maximum (3003 MHz) - prevents throttling
sudo nvidia-smi -lgc 3003,3003

# Enable core clock boost (GPU core > memory clock for compute)
sudo nvidia-smi boost-slider --vboost 1

# Enable persistence mode (reduces driver load latency)
sudo nvidia-smi -pm 1

# Verify settings
nvidia-smi --query-gpu=clocks.sm,clocks.max.sm,persistence_mode --format=csv

Note: GPU clock settings don't persist across reboots due to GB10 firmware behavior. Re-apply after each boot.

ComfyUIMini (Mobile UI)

SparkyUI includes ComfyUIMini - a lightweight, mobile-friendly web UI that runs in a separate container.

Features:

  • Responsive design optimized for phones and tablets
  • Simplified workflow execution interface
  • Built-in image gallery (reads from shared output directory)
  • Import workflows from ComfyUI in "API Format"
  • Multiple themes (dark, light, aurora, nord, etc.)

How it works:

  • Runs as a Node.js Express server in its own container (~150MB)
  • Connects to ComfyUI via internal Docker network (http://comfyui:8188)
  • Proxies REST API calls and WebSocket connections
  • Shares the output directory for gallery viewing

Access: http://<your-dgx-ip>:3000

Build only ComfyUIMini (if ComfyUI already built):

docker compose build comfyuimini
docker compose up -d comfyuimini

SageAttention Notes

SageAttention PR #297 added sm_121 support but was merged then reverted due to stability issues. Our approach:

  • Build SageAttention from main branch with TORCH_CUDA_ARCH_LIST="12.1"
  • Disable Triton via TORCHDYNAMO_DISABLE=1 (Triton doesn't support sm_121a)
  • This gives working SageAttention without the unstable PR #297 changes

For full Triton support (more complex), see HurbaLurba's DGX-SPARK-COMFYUI-DOCKER which builds custom Triton from source.

Future

When these land, SparkyUI can be simplified:

  • PyTorch native sm_121 support → remove explicit TORCH_CUDA_ARCH_LIST
  • Triton sm_121 support → remove TORCHDYNAMO_DISABLE
  • SageAttention prebuilt ARM64 wheels → remove source build

Credits

License

MIT

S
Description
ComfyUI + SageAttention for NVIDIA DGX Spark (Blackwell GB10, ARM64, sm_121)
Readme
319 KiB
Languages
Python 75.5%
JavaScript 10.6%
CSS 5.8%
HTML 5.6%
Dockerfile 2%
Other 0.5%