first commit
This commit is contained in:
134
CRITICAL_FIX_ANALYSIS.md
Normal file
134
CRITICAL_FIX_ANALYSIS.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Critical Blackwell GB10 Fixes for vLLM
|
||||
|
||||
## Overview
|
||||
|
||||
Three critical fixes are required for vLLM on Blackwell GB10 (sm_121a) GPUs with CUDA 13.0+:
|
||||
|
||||
1. **CMakeLists.txt SM120 Support** - Add missing architecture
|
||||
2. **vLLM Commit Version** - Use commit with Blackwell/Triton fixes
|
||||
3. **Triton Version Pinning** - Use tested working commit
|
||||
|
||||
## Fix 1: CMakeLists.txt SM120 Support
|
||||
|
||||
### Root Cause
|
||||
|
||||
vLLM v0.11.1rc3 CMakeLists.txt has **incomplete architecture support** for Blackwell GB10 (sm_121a) MOE kernels when using CUDA 13.0+.
|
||||
|
||||
## The Problem
|
||||
|
||||
For CUDA 13.0+, the code uses these branches:
|
||||
- **Line 490**: Regular MOE kernels
|
||||
- **Line 671**: Grouped MM MOE kernels
|
||||
|
||||
Original v0.11.1rc3:
|
||||
```cmake
|
||||
# Line 490
|
||||
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
|
||||
|
||||
# Line 671
|
||||
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
|
||||
```
|
||||
|
||||
**BOTH lines are missing `12.0f` (SM120) support!**
|
||||
|
||||
## The Fix
|
||||
|
||||
Both lines need `12.0f` added:
|
||||
```cmake
|
||||
# Line 490
|
||||
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
|
||||
|
||||
# Line 671
|
||||
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
|
||||
```
|
||||
|
||||
## Error Symptoms
|
||||
|
||||
Without this fix:
|
||||
```
|
||||
ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb
|
||||
```
|
||||
|
||||
The MOE kernels for SM100/SM120 aren't compiled, causing import failures.
|
||||
|
||||
## Why install.sh Works
|
||||
|
||||
The sed command on line 323:
|
||||
```bash
|
||||
sed -i 's/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f"/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f"/' CMakeLists.txt
|
||||
```
|
||||
|
||||
This replaces **ALL** occurrences, fixing both lines 490 and 671 in one command.
|
||||
|
||||
## Verified Solution
|
||||
|
||||
Tested on NVIDIA DGX Spark with Blackwell GB10, CUDA 13.0:
|
||||
- [OK] Line 490 fixed: `"10.0f;11.0f;12.0f"`
|
||||
- [OK] Line 671 fixed: `"10.0f;11.0f;12.0f"`
|
||||
- [OK] vLLM imports successfully
|
||||
- [OK] No cutlass_moe_mm_sm100 symbol errors
|
||||
- [OK] Build time: ~19 minutes
|
||||
|
||||
## Fix 2: vLLM Commit Version
|
||||
|
||||
### Issue
|
||||
|
||||
vLLM tag `v0.11.1rc3` lacks critical Triton/PyTorch Inductor fixes for Blackwell.
|
||||
|
||||
### Solution
|
||||
|
||||
Use commit `66a168a197ba214a5b70a74fa2e713c9eeb3251a` (6 commits ahead of v0.11.1rc3):
|
||||
- Contains Triton JIT compilation fixes
|
||||
- Includes PyTorch Inductor optimizations for Blackwell
|
||||
- Adds proper backend registration handling
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
cd vllm
|
||||
git checkout 66a168a197ba214a5b70a74fa2e713c9eeb3251a
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
## Fix 3: Triton Version Pinning
|
||||
|
||||
### Issue
|
||||
|
||||
Latest Triton main branch (as of late October 2025) has intermittent JITFunction compilation issues with PyTorch Inductor on Blackwell.
|
||||
|
||||
### Solution
|
||||
|
||||
Pin to tested working commit: `4caa0328bf8df64896dd5f6fb9df41b0eb2e750a` (October 25, 2025)
|
||||
- Verified stable with Blackwell GB10
|
||||
- Passes all compilation tests
|
||||
- No JITFunction.constexprs errors
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
cd triton
|
||||
git checkout 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a
|
||||
git submodule update --init --recursive
|
||||
python -m pip install --no-build-isolation -v .
|
||||
```
|
||||
|
||||
## Complete Verified Configuration
|
||||
|
||||
| Component | Version/Commit | Notes |
|
||||
|-----------|---------------|-------|
|
||||
| **vLLM** | `66a168a197ba214a5b70a74fa2e713c9eeb3251a` | 6 commits ahead of v0.11.1rc3 |
|
||||
| **Triton** | `4caa0328bf8df64896dd5f6fb9df41b0eb2e750a` | October 25, 2025 |
|
||||
| **PyTorch** | `2.9.0+cu130` | From vLLM requirements |
|
||||
| **CUDA** | `13.0` (V13.0.88) | System CUDA |
|
||||
| **Python** | `3.12.3` | |
|
||||
|
||||
## Testing
|
||||
|
||||
Verified working with:
|
||||
```bash
|
||||
python -c "from vllm import LLM, SamplingParams; \
|
||||
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', max_model_len=512); \
|
||||
print(llm.generate(['Hello'], SamplingParams(max_tokens=20)))"
|
||||
```
|
||||
|
||||
**All tests pass**: Import, compilation, CUDA graphs, and text generation all work correctly.
|
||||
Reference in New Issue
Block a user