Add Grace-Blackwell unified memory optimizations
Key changes based on HurbaLurba's DGX Spark research: - Remove --gpu-only flag (fights unified memory fabric) - Add --disable-pinned-memory, --force-fp16, --dont-upcast-attention - Add CUDA env vars for unified memory: CUDA_MANAGED_FORCE_DEVICE_ALLOC, PYTORCH_NO_CUDA_MEMORY_CACHING, OMP_NUM_THREADS=20 - Document unified memory architecture best practices - Add host-level GPU optimization instructions (clock locking, vboost) - Document SageAttention PR #297 status (merged then reverted) - Add credits section 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
+7
-1
@@ -9,7 +9,13 @@ SPARKYUI_DATA_PATH=/path/to/SparkyUI
|
|||||||
|
|
||||||
# ComfyUI settings
|
# ComfyUI settings
|
||||||
COMFYUI_PORT=8188
|
COMFYUI_PORT=8188
|
||||||
COMFYUI_FLAGS=--listen 0.0.0.0 --port 8188 --gpu-only
|
|
||||||
|
# Optimized flags for Grace-Blackwell unified memory architecture
|
||||||
|
# Key: DON'T use --gpu-only - it fights the unified memory fabric
|
||||||
|
# --disable-pinned-memory: reduces overhead on unified fabric
|
||||||
|
# --force-fp16 + --fp16-*: enables SageAttention optimization
|
||||||
|
# --dont-upcast-attention: keeps attention in FP16 for speed
|
||||||
|
COMFYUI_FLAGS=--listen 0.0.0.0 --port 8188 --disable-pinned-memory --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --dont-upcast-attention
|
||||||
|
|
||||||
# Build refs (pin to specific commits/tags for reproducibility)
|
# Build refs (pin to specific commits/tags for reproducibility)
|
||||||
COMFYUI_REF=master
|
COMFYUI_REF=master
|
||||||
|
|||||||
@@ -19,6 +19,31 @@ Standard ComfyUI containers and PyTorch wheels don't support sm_121. SparkyUI so
|
|||||||
2. Installing **PyTorch cu130** ARM64 wheels
|
2. Installing **PyTorch cu130** ARM64 wheels
|
||||||
3. Compiling **SageAttention** with `TORCH_CUDA_ARCH_LIST="12.1"`
|
3. Compiling **SageAttention** with `TORCH_CUDA_ARCH_LIST="12.1"`
|
||||||
4. Disabling **Triton/torch.compile** (doesn't support sm_121 yet)
|
4. Disabling **Triton/torch.compile** (doesn't support sm_121 yet)
|
||||||
|
5. **Optimized for Grace-Blackwell unified memory architecture**
|
||||||
|
|
||||||
|
## Unified Memory Architecture
|
||||||
|
|
||||||
|
The DGX Spark's Grace-Blackwell architecture uses **unified memory** - a coherent memory fabric shared between CPU and GPU. This is fundamentally different from discrete GPUs and requires different optimization strategies.
|
||||||
|
|
||||||
|
**Key insight: Don't fight the fabric.** Forcing everything GPU-side (`--gpu-only`, `--cache-none`) actually hurts performance.
|
||||||
|
|
||||||
|
**Optimized flags (default in SparkyUI):**
|
||||||
|
```bash
|
||||||
|
--disable-pinned-memory # Reduces overhead on unified fabric
|
||||||
|
--force-fp16 # Enables SageAttention optimization
|
||||||
|
--fp16-unet --fp16-vae --fp16-text-enc # FP16 precision throughout
|
||||||
|
--dont-upcast-attention # Keeps attention in FP16 for speed
|
||||||
|
```
|
||||||
|
|
||||||
|
**What NOT to use:**
|
||||||
|
- `--gpu-only` - fights the unified memory fabric, hurts performance
|
||||||
|
- `--cache-none` - disables natural caching, slows model loading
|
||||||
|
- `--disable-mmap` - prevents memory-mapped model loading
|
||||||
|
|
||||||
|
**CUDA environment variables** are also tuned for unified memory:
|
||||||
|
- `CUDA_MANAGED_FORCE_DEVICE_ALLOC=1` - prefer GPU allocation
|
||||||
|
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1` - let fabric manage memory
|
||||||
|
- `OMP_NUM_THREADS=20` - utilize all 20 ARM cores
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
@@ -132,6 +157,36 @@ The entrypoint auto-clones it. Check logs:
|
|||||||
docker compose logs | grep -i manager
|
docker compose logs | grep -i manager
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Host-Level GPU Optimizations (Optional)
|
||||||
|
|
||||||
|
For maximum performance, apply these optimizations on the **host** (not in Docker):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Lock GPU clocks to maximum (3003 MHz) - prevents throttling
|
||||||
|
sudo nvidia-smi -lgc 3003,3003
|
||||||
|
|
||||||
|
# Enable core clock boost (GPU core > memory clock for compute)
|
||||||
|
sudo nvidia-smi boost-slider --vboost 1
|
||||||
|
|
||||||
|
# Enable persistence mode (reduces driver load latency)
|
||||||
|
sudo nvidia-smi -pm 1
|
||||||
|
|
||||||
|
# Verify settings
|
||||||
|
nvidia-smi --query-gpu=clocks.sm,clocks.max.sm,persistence_mode --format=csv
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** GPU clock settings don't persist across reboots due to GB10 firmware behavior. Re-apply after each boot.
|
||||||
|
|
||||||
|
## SageAttention Notes
|
||||||
|
|
||||||
|
SageAttention PR #297 added sm_121 support but was merged then reverted due to stability issues. Our approach:
|
||||||
|
|
||||||
|
- Build SageAttention from main branch with `TORCH_CUDA_ARCH_LIST="12.1"`
|
||||||
|
- Disable Triton via `TORCHDYNAMO_DISABLE=1` (Triton doesn't support sm_121a)
|
||||||
|
- This gives working SageAttention without the unstable PR #297 changes
|
||||||
|
|
||||||
|
For full Triton support (more complex), see [HurbaLurba's DGX-SPARK-COMFYUI-DOCKER](https://github.com/HurbaLurba/DGX-SPARK-COMFYUI-DOCKER) which builds custom Triton from source.
|
||||||
|
|
||||||
## Future
|
## Future
|
||||||
|
|
||||||
When these land, SparkyUI can be simplified:
|
When these land, SparkyUI can be simplified:
|
||||||
@@ -139,6 +194,12 @@ When these land, SparkyUI can be simplified:
|
|||||||
- [ ] Triton sm_121 support → remove `TORCHDYNAMO_DISABLE`
|
- [ ] Triton sm_121 support → remove `TORCHDYNAMO_DISABLE`
|
||||||
- [ ] SageAttention prebuilt ARM64 wheels → remove source build
|
- [ ] SageAttention prebuilt ARM64 wheels → remove source build
|
||||||
|
|
||||||
|
## Credits
|
||||||
|
|
||||||
|
- Unified memory architecture insights from [HurbaLurba's DGX-SPARK-COMFYUI-DOCKER](https://github.com/HurbaLurba/DGX-SPARK-COMFYUI-DOCKER)
|
||||||
|
- SageAttention by [thu-ml](https://github.com/thu-ml/SageAttention)
|
||||||
|
- ComfyUI by [comfyanonymous](https://github.com/comfyanonymous/ComfyUI)
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
MIT
|
MIT
|
||||||
|
|||||||
+14
-1
@@ -27,13 +27,26 @@ services:
|
|||||||
|
|
||||||
environment:
|
environment:
|
||||||
COMFYUI_PORT: "${COMFYUI_PORT:-8188}"
|
COMFYUI_PORT: "${COMFYUI_PORT:-8188}"
|
||||||
COMFYUI_FLAGS: "${COMFYUI_FLAGS:---listen 0.0.0.0 --port 8188 --gpu-only}"
|
# Optimized for Grace-Blackwell unified memory architecture
|
||||||
|
# Key insight: DON'T use --gpu-only - let the unified memory fabric work naturally
|
||||||
|
COMFYUI_FLAGS: "${COMFYUI_FLAGS:---listen 0.0.0.0 --port 8188 --disable-pinned-memory --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --dont-upcast-attention}"
|
||||||
NVIDIA_VISIBLE_DEVICES: "all"
|
NVIDIA_VISIBLE_DEVICES: "all"
|
||||||
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
|
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
|
||||||
|
|
||||||
# Disable torch.compile/inductor - Triton doesn't support Blackwell sm_121a yet
|
# Disable torch.compile/inductor - Triton doesn't support Blackwell sm_121a yet
|
||||||
TORCH_COMPILE_DISABLE: "1"
|
TORCH_COMPILE_DISABLE: "1"
|
||||||
TORCHDYNAMO_DISABLE: "1"
|
TORCHDYNAMO_DISABLE: "1"
|
||||||
|
|
||||||
|
# Grace-Blackwell unified memory optimizations
|
||||||
|
CUDA_CACHE_DISABLE: "1"
|
||||||
|
PYTORCH_NO_CUDA_MEMORY_CACHING: "1"
|
||||||
|
CUDA_DEVICE_MAX_CONNECTIONS: "1"
|
||||||
|
CUDA_DEVICE_MAX_COPY_CONNECTIONS: "4"
|
||||||
|
CUDA_MODULE_LOADING: "EAGER"
|
||||||
|
CUDA_MANAGED_FORCE_DEVICE_ALLOC: "1"
|
||||||
|
OMP_NUM_THREADS: "20"
|
||||||
|
CUBLAS_WORKSPACE_CONFIG: ":0:0"
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
# Models from existing ComfyUI install (read-only)
|
# Models from existing ComfyUI install (read-only)
|
||||||
- ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro
|
- ${COMFYUI_HOST_PATH}/models:/opt/ComfyUI/models:ro
|
||||||
|
|||||||
Reference in New Issue
Block a user