247 lines
6.8 KiB
Markdown
247 lines
6.8 KiB
Markdown
# Repository Summary
|
|
|
|
## Overview
|
|
|
|
This repository provides a **production-ready, one-command installation** of vLLM for NVIDIA DGX Spark systems with Blackwell GB10 GPUs (sm_121 architecture).
|
|
|
|
## What's Included
|
|
|
|
### Core Files
|
|
|
|
1. **install.sh** (500+ lines)
|
|
- Fully automated installation script
|
|
- Pre-flight system checks
|
|
- 8-step installation pipeline
|
|
- Post-installation testing
|
|
- Command-line argument support
|
|
|
|
2. **README.md** (300+ lines)
|
|
- Quick start guide
|
|
- System requirements
|
|
- Usage examples
|
|
- Critical fixes documentation
|
|
- Troubleshooting guide
|
|
|
|
3. **CLUSTER.md** (400+ lines)
|
|
- Multi-node setup instructions
|
|
- Ray cluster configuration
|
|
- Tensor/pipeline parallelism
|
|
- Performance tuning
|
|
- Load balancing examples
|
|
|
|
4. **requirements.txt**
|
|
- Complete dependency list
|
|
- PyTorch 2.9.0+cu130
|
|
- All required packages
|
|
|
|
### Helper Scripts (scripts/)
|
|
|
|
- **vllm-serve.sh** - Start vLLM server with configurable model/port
|
|
- **vllm-stop.sh** - Gracefully stop server
|
|
- **vllm-status.sh** - Check server status and logs
|
|
|
|
### Examples (examples/)
|
|
|
|
- **basic_inference.py** - Simple Python API usage
|
|
- **api_client.py** - OpenAI-compatible REST API client
|
|
- **README.md** - Usage instructions and API examples
|
|
|
|
### Configuration
|
|
|
|
- **.gitignore** - Excludes build artifacts, venvs, logs
|
|
- **LICENSE** - MIT license
|
|
|
|
## Technical Specifications
|
|
|
|
### Target Platform
|
|
- **Hardware:** NVIDIA DGX Spark with GB10 GPU
|
|
- **Architecture:** Blackwell sm_121 (compute capability 12.1)
|
|
- **OS:** Ubuntu 22.04+ ARM64
|
|
- **CUDA:** 13.0+ (driver 580.95.05+)
|
|
|
|
### Software Stack
|
|
- **Python:** 3.12.3
|
|
- **PyTorch:** 2.9.0+cu130
|
|
- **Triton:** 3.5.0+git (from main branch)
|
|
- **vLLM:** 0.11.1rc4+
|
|
- **Package Manager:** uv (fast Python package installer)
|
|
|
|
### Critical Fixes Applied
|
|
|
|
1. **CMakeLists.txt (line 671)**
|
|
- Added `12.0f` to SCALED_MM_ARCHS for SM100 MOE kernels
|
|
- Enables Blackwell GPU compilation
|
|
|
|
2. **pyproject.toml**
|
|
- Changed `license = "Apache-2.0"` to `license = {text = "Apache-2.0"}`
|
|
- Removed deprecated `license-files` field
|
|
- Compatible with setuptools 77.0+
|
|
|
|
3. **Triton Build**
|
|
- Must use main branch (not release 3.5.0)
|
|
- Non-editable install to avoid setuptools bug
|
|
- Custom PTXAS path for CUDA integration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
TORCH_CUDA_ARCH_LIST=12.1a # Blackwell architecture
|
|
VLLM_USE_FLASHINFER_MXFP4_MOE=1 # Enable FlashInfer optimization
|
|
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas # CUDA PTX assembler
|
|
```
|
|
|
|
## Installation Overview
|
|
|
|
The `install.sh` script performs these steps:
|
|
|
|
1. **Pre-flight Checks**
|
|
- Verify ARM64 architecture
|
|
- Check NVIDIA GPU (GB10)
|
|
- Validate CUDA 13.0+
|
|
- Ensure 50GB+ disk space
|
|
|
|
2. **Install uv Package Manager**
|
|
- Fast Python package installer
|
|
- Required for efficient dependency resolution
|
|
|
|
3. **Create Virtual Environment**
|
|
- Python 3.12 virtual environment
|
|
- Isolated from system packages
|
|
|
|
4. **Install PyTorch**
|
|
- PyTorch 2.9.0 with CUDA 13.0 bindings
|
|
- Verify CUDA availability
|
|
|
|
5. **Build Triton**
|
|
- Clone from GitHub main branch
|
|
- Build with Blackwell support
|
|
- Non-editable install
|
|
|
|
6. **Install Dependencies**
|
|
- xgrammar, setuptools-scm
|
|
- apache-tvm-ffi (prerelease)
|
|
- Build tools
|
|
|
|
7. **Clone and Fix vLLM**
|
|
- Clone v0.11.1rc3
|
|
- Apply CMakeLists.txt fix
|
|
- Apply pyproject.toml fix
|
|
- Configure use_existing_torch
|
|
|
|
8. **Build vLLM**
|
|
- 15-20 minute compilation
|
|
- All CUDA kernels for Blackwell
|
|
- Editable install for development
|
|
|
|
9. **Create Helper Scripts**
|
|
- Environment activation script
|
|
- Server management scripts
|
|
- Logging configuration
|
|
|
|
10. **Post-Installation Tests**
|
|
- Import vLLM
|
|
- Check CUDA availability
|
|
- Verify GPU detection
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# One-command installation
|
|
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
|
|
|
|
# Or clone and run
|
|
git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
|
|
cd dgx-spark-vllm-setup
|
|
./install.sh
|
|
|
|
# Activate environment (assuming installation in current directory)
|
|
cd vllm-install
|
|
source vllm_env.sh
|
|
|
|
# Start server
|
|
./vllm-serve.sh
|
|
|
|
# Test API
|
|
curl http://localhost:8000/v1/models
|
|
```
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
dgx-spark-vllm-setup/
|
|
├── README.md # Main documentation
|
|
├── CLUSTER.md # Multi-node setup guide
|
|
├── SUMMARY.md # This file
|
|
├── LICENSE # MIT license
|
|
├── .gitignore # Git ignore rules
|
|
├── install.sh # Main installation script
|
|
├── requirements.txt # Python dependencies
|
|
├── scripts/
|
|
│ ├── vllm-serve.sh # Start vLLM server
|
|
│ ├── vllm-stop.sh # Stop server
|
|
│ └── vllm-status.sh # Check status
|
|
└── examples/
|
|
├── README.md # Examples documentation
|
|
├── basic_inference.py # Python API example
|
|
└── api_client.py # REST API example
|
|
```
|
|
|
|
## Known Issues & Workarounds
|
|
|
|
### Triton Editable Build Fails
|
|
**Error:** `TypeError: can only concatenate str (not 'NoneType') to str`
|
|
**Workaround:** Use non-editable install (`uv pip install --no-build-isolation .`)
|
|
|
|
### PyTorch CUDA Capability Warning
|
|
**Warning:** GPU capability 12.1 vs PyTorch max 12.0
|
|
**Status:** Harmless - PyTorch 2.9.0+cu130 works correctly with GB10
|
|
|
|
### apache-tvm-ffi Prerelease
|
|
**Error:** `No solution found when resolving dependencies`
|
|
**Fix:** Use `--prerelease=allow` flag with uv pip install
|
|
|
|
## Testing Status
|
|
|
|
- [OK] Single-node installation on spark-alpha.local
|
|
- [OK] Single-node installation on spark-omega.local
|
|
- [OK] vLLM server startup and API functionality
|
|
- [OK] Model inference (Qwen/Qwen2.5-0.5B-Instruct)
|
|
- [IN PROGRESS] Multi-node cluster mode (documented, not yet tested)
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Add cluster mode testing results
|
|
- [ ] Include performance benchmarks
|
|
- [ ] Add Dockerfile for containerized deployment
|
|
- [ ] Create Ansible playbook for multi-node automation
|
|
- [ ] Add monitoring and logging setup (Prometheus/Grafana)
|
|
- [ ] Include model quantization examples (AWQ, GPTQ)
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please open issues or pull requests on GitHub.
|
|
|
|
## Community & Support
|
|
|
|
- **GitHub Issues:** Report bugs and feature requests
|
|
- **NVIDIA Forum:** [DGX Spark vLLM Discussion](https://forums.developer.nvidia.com/t/run-vllm-in-spark/348862)
|
|
- **vLLM Docs:** [Official Documentation](https://docs.vllm.ai/)
|
|
|
|
## License
|
|
|
|
MIT License - See LICENSE file for details.
|
|
|
|
## Acknowledgments
|
|
|
|
Developed and tested on NVIDIA DGX Spark systems. Special thanks to:
|
|
- vLLM project team
|
|
- Triton compiler team
|
|
- NVIDIA DGX Spark community
|
|
- Claude Code (AI assistant) for documentation automation
|
|
|
|
---
|
|
|
|
**Version:** 1.0.0
|
|
**Last Updated:** 2025-10-26
|
|
**Tested On:** DGX Spark with GB10, CUDA 13.0, Ubuntu 22.04 ARM64
|