first commit

2026-03-22 17:26:26 -04:00
commit c05cb71816
15 changed files with 2644 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,312 @@
+# vLLM Setup for NVIDIA DGX Spark (Blackwell GB10)
+
+**One-command installation** of vLLM for NVIDIA DGX Spark systems with GB10 GPUs (Blackwell architecture, sm_121).
+
+This repository provides a dgx-spark tested, ready setup script that handles all the complexities of building vLLM on the DGX Spark platform, including:
+- CUDA 13.0 support with Blackwell-specific optimizations
+- Critical fixes for SM100/SM120 MOE kernel compilation
+- Triton 3.5.0 from main branch (required for sm_121a support)
+- PyTorch 2.9.0 with CUDA 13.0 bindings
+- All necessary build fixes and workarounds
+
+## Quick Start
+
+**One-command installation** - installs to `./vllm-install` in your current directory:
+
+```bash
+curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
+```
+
+Or specify a custom directory:
+
+```bash
+curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash -s -- --install-dir ~/my/custom/path
+```
+
+**Installation time:** ~20-30 minutes (mostly compilation)
+
+### Alternative: Clone and Install
+
+```bash
+git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
+cd dgx-spark-vllm-setup
+./install.sh
+```
+
+### Installation Options
+
+```bash
+./install.sh [OPTIONS]
+
+Options:
+  --install-dir DIR    Installation directory (default: ./vllm-install)
+  --vllm-version TAG   vLLM git tag/branch (default: v0.11.1rc3)
+  --python-version VER Python version (default: 3.12)
+  --skip-tests         Skip post-installation tests
+  --help               Show help message
+```
+
+## System Requirements
+
+- **Hardware:** NVIDIA DGX Spark with GB10 GPU (Blackwell sm_121)
+- **OS:** Ubuntu 22.04+ (tested on Linux 6.11.0 ARM64)
+- **CUDA:** 13.0 or later (driver 580.95.05+)
+- **Disk Space:** ~50GB free
+- **RAM:** 8GB+ recommended during build
+
+## What Gets Installed
+
+Installed to `./vllm-install` (or your custom directory):
+
+- **Python 3.12** virtual environment at `.vllm/`
+- **PyTorch 2.9.0+cu130** with full CUDA 13.0 support
+- **Triton 3.5.0+git** from main branch (pre-release with Blackwell support)
+- **vLLM 0.11.1rc3+** with all Blackwell-specific patches
+- **Helper scripts** for managing vLLM server
+- **Environment activation** script (`vllm_env.sh`)
+
+## Usage
+
+All examples assume you're in the installation directory (default: `./vllm-install`).
+
+### Activate Environment
+
+```bash
+cd vllm-install
+source vllm_env.sh
+```
+
+### Start vLLM Server
+
+```bash
+./vllm-serve.sh                                    # Default: Qwen2.5-0.5B on port 8000
+./vllm-serve.sh "facebook/opt-125m" 8001          # Custom model and port
+```
+
+### Check Server Status
+
+```bash
+./vllm-status.sh
+```
+
+### Stop Server
+
+```bash
+./vllm-stop.sh
+```
+
+### Test API
+
+```bash
+# List models
+curl http://localhost:8000/v1/models
+
+# Generate completion
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "prompt": "Hello, how are you?",
+    "max_tokens": 50
+  }'
+```
+
+### Python API
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(
+    model="Qwen/Qwen2.5-0.5B-Instruct",
+    trust_remote_code=True,
+    gpu_memory_utilization=0.9
+)
+
+prompts = ["Tell me about DGX Spark"]
+sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
+outputs = llm.generate(prompts, sampling_params)
+
+print(outputs[0].outputs[0].text)
+```
+
+## Critical Fixes Applied
+
+This installer automatically applies the following critical fixes:
+
+### 1. CMakeLists.txt SM100/SM120 MOE Kernel Fix
+
+**Issue:** vLLM's MOE kernels for SM100/SM120 Blackwell architectures were incomplete
+**Fix:** Added `12.0f` and `12.1a` to SCALED_MM_ARCHS in CMakeLists.txt
+
+```cmake
+# CUDA 13.0+ path (line ~671)
+# Before
+cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
+# After
+cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
+
+# Older CUDA path (line ~673)
+# Before
+cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
+# After
+cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;12.1a" "${CUDA_ARCHS}")
+```
+
+### 2. pyproject.toml License Field Format
+
+**Issue:** Newer setuptools requires structured license format
+**Fix:** Convert license string to dict format in both vLLM and flashinfer-python
+
+```toml
+# Before
+license = "Apache-2.0"
+license-files = ["LICENSE"]
+
+# After
+license = {text = "Apache-2.0"}
+```
+
+**Applied to:**
+- vLLM's pyproject.toml
+- flashinfer-python's pyproject.toml (patched during build)
+
+### 3. GPT-OSS Triton MOE Kernels for Qwen3/gpt-oss Support
+
+**Issue:** vLLM's GPT-OSS MOE kernel implementation uses deprecated Triton routing API
+**Fix:** Update to new Triton kernel API (topk and SparseMatrix)
+
+**Changes:**
+- Replace deprecated `routing()` with `triton_topk()`
+- Replace deprecated `routing_from_bitmatrix()` with `SparseMatrix()`
+- Add support for `GatherIndx`, `ScatterIndx`, and new ragged tensor metadata
+
+**Enables support for:**
+- Qwen3 models with MOE architecture
+- gpt-oss models using Triton kernels
+- Latest Triton kernel optimizations for Blackwell
+
+### 4. Triton Main Branch Requirement
+
+**Issue:** Official Triton 3.5.0 release has bugs with sm_121a
+**Fix:** Build Triton from main branch with latest Blackwell fixes
+
+## Architecture-Specific Configuration
+
+The installer sets these critical environment variables:
+
+```bash
+TORCH_CUDA_ARCH_LIST=12.1a                      # Blackwell sm_121
+VLLM_USE_FLASHINFER_MXFP4_MOE=1                 # Enable FlashInfer MOE optimization
+TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas     # CUDA PTX assembler
+TIKTOKEN_CACHE_DIR=$INSTALL_DIR/.tiktoken_cache # Cache tiktoken encodings locally
+```
+
+## Cluster Mode Setup
+
+To set up multi-node vLLM cluster:
+
+1. Run this installer on all nodes
+2. Follow [CLUSTER.md](./CLUSTER.md) for configuration
+
+## Troubleshooting
+
+### Build Fails with "TypeError: can only concatenate str (not 'NoneType') to str"
+
+This is a known Triton editable-mode build issue. The installer works around this by:
+- Building Triton in non-editable mode
+- Or copying pre-built Triton from another node
+
+### Symbol Error: cutlass_moe_mm_sm100
+
+**Symptom:** `ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100`
+**Solution:** Ensure CMakeLists.txt fix is applied (done automatically by installer)
+
+### PyTorch CUDA Capability Warning
+
+**Symptom:** Warning about GPU capability 12.1 vs PyTorch max 12.0
+**Status:** Harmless warning - PyTorch 2.9.0+cu130 works correctly with GB10
+
+### ImportError: No module named 'vllm'
+
+**Solution:**
+```bash
+source vllm-install/vllm_env.sh
+python -c "import vllm; print(vllm.__version__)"
+```
+
+## File Structure
+
+```
+vllm-install/
+├── .vllm/                  # Python virtual environment
+├── vllm/                   # vLLM source (editable install)
+├── triton/                 # Triton source
+├── vllm_env.sh            # Environment activation script
+├── vllm-serve.sh          # Start server
+├── vllm-stop.sh           # Stop server
+├── vllm-status.sh         # Check status
+└── vllm-server.log        # Server logs
+```
+
+## Manual Installation
+
+If you prefer to understand each step:
+
+```bash
+# 1. Install uv package manager
+curl -LsSf https://astral.sh/uv/install.sh | sh
+export PATH="$HOME/.local/bin:$PATH"
+
+# 2. Create installation directory and Python virtual environment
+mkdir -p vllm-install && cd vllm-install
+uv venv .vllm --python 3.12
+source .vllm/bin/activate
+
+# 3. Install PyTorch with CUDA 13.0
+uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
+
+# 4. Clone and build Triton from main
+git clone https://github.com/triton-lang/triton.git
+cd triton
+uv pip install pip cmake ninja pybind11
+TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas python -m pip install --no-build-isolation .
+
+# 5. Install additional dependencies
+uv pip install xgrammar setuptools-scm apache-tvm-ffi==0.1.0b15 --prerelease=allow
+
+# 6. Clone vLLM
+cd ..
+git clone --recursive https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout v0.11.1rc3
+
+# 7. Apply fixes (see scripts/apply-fixes.sh)
+# 8. Build vLLM (see install.sh for full process)
+```
+
+## Version Information
+
+- **vLLM:** 0.11.1rc4.dev6+g66a168a19.d20251026
+- **PyTorch:** 2.9.0+cu130
+- **Triton:** 3.5.0+git4caa0328
+- **CUDA:** 13.0
+- **Python:** 3.12.3
+- **Target Architecture:** sm_121 (Blackwell GB10)
+
+## Contributing
+
+Issues and pull requests welcome! This installer is maintained by the DGX Spark community.
+
+## References
+
+- [NVIDIA Forum Discussion](https://forums.developer.nvidia.com/t/run-vllm-in-spark/348862)
+- [vLLM GitHub](https://github.com/vllm-project/vllm)
+- [Triton GitHub](https://github.com/triton-lang/triton)
+
+## License
+
+MIT License - See [LICENSE](./LICENSE)
+
+## Acknowledgments
+
+Developed and tested on NVIDIA DGX Spark systems. Special thanks to the vLLM and Triton communities.