first commit

2026-03-22 17:26:26 -04:00
commit c05cb71816
15 changed files with 2644 additions and 0 deletions
--- a/SUMMARY.md
+++ b/SUMMARY.md
@@ -0,0 +1,246 @@
+# Repository Summary
+
+## Overview
+
+This repository provides a **production-ready, one-command installation** of vLLM for NVIDIA DGX Spark systems with Blackwell GB10 GPUs (sm_121 architecture).
+
+## What's Included
+
+### Core Files
+
+1. **install.sh** (500+ lines)
+   - Fully automated installation script
+   - Pre-flight system checks
+   - 8-step installation pipeline
+   - Post-installation testing
+   - Command-line argument support
+
+2. **README.md** (300+ lines)
+   - Quick start guide
+   - System requirements
+   - Usage examples
+   - Critical fixes documentation
+   - Troubleshooting guide
+
+3. **CLUSTER.md** (400+ lines)
+   - Multi-node setup instructions
+   - Ray cluster configuration
+   - Tensor/pipeline parallelism
+   - Performance tuning
+   - Load balancing examples
+
+4. **requirements.txt**
+   - Complete dependency list
+   - PyTorch 2.9.0+cu130
+   - All required packages
+
+### Helper Scripts (scripts/)
+
+- **vllm-serve.sh** - Start vLLM server with configurable model/port
+- **vllm-stop.sh** - Gracefully stop server
+- **vllm-status.sh** - Check server status and logs
+
+### Examples (examples/)
+
+- **basic_inference.py** - Simple Python API usage
+- **api_client.py** - OpenAI-compatible REST API client
+- **README.md** - Usage instructions and API examples
+
+### Configuration
+
+- **.gitignore** - Excludes build artifacts, venvs, logs
+- **LICENSE** - MIT license
+
+## Technical Specifications
+
+### Target Platform
+- **Hardware:** NVIDIA DGX Spark with GB10 GPU
+- **Architecture:** Blackwell sm_121 (compute capability 12.1)
+- **OS:** Ubuntu 22.04+ ARM64
+- **CUDA:** 13.0+ (driver 580.95.05+)
+
+### Software Stack
+- **Python:** 3.12.3
+- **PyTorch:** 2.9.0+cu130
+- **Triton:** 3.5.0+git (from main branch)
+- **vLLM:** 0.11.1rc4+
+- **Package Manager:** uv (fast Python package installer)
+
+### Critical Fixes Applied
+
+1. **CMakeLists.txt (line 671)**
+   - Added `12.0f` to SCALED_MM_ARCHS for SM100 MOE kernels
+   - Enables Blackwell GPU compilation
+
+2. **pyproject.toml**
+   - Changed `license = "Apache-2.0"` to `license = {text = "Apache-2.0"}`
+   - Removed deprecated `license-files` field
+   - Compatible with setuptools 77.0+
+
+3. **Triton Build**
+   - Must use main branch (not release 3.5.0)
+   - Non-editable install to avoid setuptools bug
+   - Custom PTXAS path for CUDA integration
+
+### Environment Variables
+
+```bash
+TORCH_CUDA_ARCH_LIST=12.1a               # Blackwell architecture
+VLLM_USE_FLASHINFER_MXFP4_MOE=1         # Enable FlashInfer optimization
+TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas  # CUDA PTX assembler
+```
+
+## Installation Overview
+
+The `install.sh` script performs these steps:
+
+1. **Pre-flight Checks**
+   - Verify ARM64 architecture
+   - Check NVIDIA GPU (GB10)
+   - Validate CUDA 13.0+
+   - Ensure 50GB+ disk space
+
+2. **Install uv Package Manager**
+   - Fast Python package installer
+   - Required for efficient dependency resolution
+
+3. **Create Virtual Environment**
+   - Python 3.12 virtual environment
+   - Isolated from system packages
+
+4. **Install PyTorch**
+   - PyTorch 2.9.0 with CUDA 13.0 bindings
+   - Verify CUDA availability
+
+5. **Build Triton**
+   - Clone from GitHub main branch
+   - Build with Blackwell support
+   - Non-editable install
+
+6. **Install Dependencies**
+   - xgrammar, setuptools-scm
+   - apache-tvm-ffi (prerelease)
+   - Build tools
+
+7. **Clone and Fix vLLM**
+   - Clone v0.11.1rc3
+   - Apply CMakeLists.txt fix
+   - Apply pyproject.toml fix
+   - Configure use_existing_torch
+
+8. **Build vLLM**
+   - 15-20 minute compilation
+   - All CUDA kernels for Blackwell
+   - Editable install for development
+
+9. **Create Helper Scripts**
+   - Environment activation script
+   - Server management scripts
+   - Logging configuration
+
+10. **Post-Installation Tests**
+    - Import vLLM
+    - Check CUDA availability
+    - Verify GPU detection
+
+## Quick Start
+
+```bash
+# One-command installation
+curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
+
+# Or clone and run
+git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
+cd dgx-spark-vllm-setup
+./install.sh
+
+# Activate environment (assuming installation in current directory)
+cd vllm-install
+source vllm_env.sh
+
+# Start server
+./vllm-serve.sh
+
+# Test API
+curl http://localhost:8000/v1/models
+```
+
+## Repository Structure
+
+```
+dgx-spark-vllm-setup/
+├── README.md              # Main documentation
+├── CLUSTER.md             # Multi-node setup guide
+├── SUMMARY.md             # This file
+├── LICENSE                # MIT license
+├── .gitignore             # Git ignore rules
+├── install.sh             # Main installation script
+├── requirements.txt       # Python dependencies
+├── scripts/
+│   ├── vllm-serve.sh      # Start vLLM server
+│   ├── vllm-stop.sh       # Stop server
+│   └── vllm-status.sh     # Check status
+└── examples/
+    ├── README.md          # Examples documentation
+    ├── basic_inference.py # Python API example
+    └── api_client.py      # REST API example
+```
+
+## Known Issues & Workarounds
+
+### Triton Editable Build Fails
+**Error:** `TypeError: can only concatenate str (not 'NoneType') to str`  
+**Workaround:** Use non-editable install (`uv pip install --no-build-isolation .`)
+
+### PyTorch CUDA Capability Warning
+**Warning:** GPU capability 12.1 vs PyTorch max 12.0  
+**Status:** Harmless - PyTorch 2.9.0+cu130 works correctly with GB10
+
+### apache-tvm-ffi Prerelease
+**Error:** `No solution found when resolving dependencies`  
+**Fix:** Use `--prerelease=allow` flag with uv pip install
+
+## Testing Status
+
+- [OK] Single-node installation on spark-alpha.local
+- [OK] Single-node installation on spark-omega.local
+- [OK] vLLM server startup and API functionality
+- [OK] Model inference (Qwen/Qwen2.5-0.5B-Instruct)
+- [IN PROGRESS] Multi-node cluster mode (documented, not yet tested)
+
+## Future Enhancements
+
+- [ ] Add cluster mode testing results
+- [ ] Include performance benchmarks
+- [ ] Add Dockerfile for containerized deployment
+- [ ] Create Ansible playbook for multi-node automation
+- [ ] Add monitoring and logging setup (Prometheus/Grafana)
+- [ ] Include model quantization examples (AWQ, GPTQ)
+
+## Contributing
+
+Contributions welcome! Please open issues or pull requests on GitHub.
+
+## Community & Support
+
+- **GitHub Issues:** Report bugs and feature requests
+- **NVIDIA Forum:** [DGX Spark vLLM Discussion](https://forums.developer.nvidia.com/t/run-vllm-in-spark/348862)
+- **vLLM Docs:** [Official Documentation](https://docs.vllm.ai/)
+
+## License
+
+MIT License - See LICENSE file for details.
+
+## Acknowledgments
+
+Developed and tested on NVIDIA DGX Spark systems. Special thanks to:
+- vLLM project team
+- Triton compiler team
+- NVIDIA DGX Spark community
+- Claude Code (AI assistant) for documentation automation
+
+---
+
+**Version:** 1.0.0  
+**Last Updated:** 2025-10-26  
+**Tested On:** DGX Spark with GB10, CUDA 13.0, Ubuntu 22.04 ARM64