3.7 KiB
Critical Blackwell GB10 Fixes for vLLM
Overview
Three critical fixes are required for vLLM on Blackwell GB10 (sm_121a) GPUs with CUDA 13.0+:
- CMakeLists.txt SM120 Support - Add missing architecture
- vLLM Commit Version - Use commit with Blackwell/Triton fixes
- Triton Version Pinning - Use tested working commit
Fix 1: CMakeLists.txt SM120 Support
Root Cause
vLLM v0.11.1rc3 CMakeLists.txt has incomplete architecture support for Blackwell GB10 (sm_121a) MOE kernels when using CUDA 13.0+.
The Problem
For CUDA 13.0+, the code uses these branches:
- Line 490: Regular MOE kernels
- Line 671: Grouped MM MOE kernels
Original v0.11.1rc3:
# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
BOTH lines are missing 12.0f (SM120) support!
The Fix
Both lines need 12.0f added:
# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
Error Symptoms
Without this fix:
ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb
The MOE kernels for SM100/SM120 aren't compiled, causing import failures.
Why install.sh Works
The sed command on line 323:
sed -i 's/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f"/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f"/' CMakeLists.txt
This replaces ALL occurrences, fixing both lines 490 and 671 in one command.
Verified Solution
Tested on NVIDIA DGX Spark with Blackwell GB10, CUDA 13.0:
- [OK] Line 490 fixed:
"10.0f;11.0f;12.0f" - [OK] Line 671 fixed:
"10.0f;11.0f;12.0f" - [OK] vLLM imports successfully
- [OK] No cutlass_moe_mm_sm100 symbol errors
- [OK] Build time: ~19 minutes
Fix 2: vLLM Commit Version
Issue
vLLM tag v0.11.1rc3 lacks critical Triton/PyTorch Inductor fixes for Blackwell.
Solution
Use commit 66a168a197ba214a5b70a74fa2e713c9eeb3251a (6 commits ahead of v0.11.1rc3):
- Contains Triton JIT compilation fixes
- Includes PyTorch Inductor optimizations for Blackwell
- Adds proper backend registration handling
Installation
cd vllm
git checkout 66a168a197ba214a5b70a74fa2e713c9eeb3251a
git submodule update --init --recursive
Fix 3: Triton Version Pinning
Issue
Latest Triton main branch (as of late October 2025) has intermittent JITFunction compilation issues with PyTorch Inductor on Blackwell.
Solution
Pin to tested working commit: 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a (October 25, 2025)
- Verified stable with Blackwell GB10
- Passes all compilation tests
- No JITFunction.constexprs errors
Installation
cd triton
git checkout 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a
git submodule update --init --recursive
python -m pip install --no-build-isolation -v .
Complete Verified Configuration
| Component | Version/Commit | Notes |
|---|---|---|
| vLLM | 66a168a197ba214a5b70a74fa2e713c9eeb3251a |
6 commits ahead of v0.11.1rc3 |
| Triton | 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a |
October 25, 2025 |
| PyTorch | 2.9.0+cu130 |
From vLLM requirements |
| CUDA | 13.0 (V13.0.88) |
System CUDA |
| Python | 3.12.3 |
Testing
Verified working with:
python -c "from vllm import LLM, SamplingParams; \
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', max_model_len=512); \
print(llm.generate(['Hello'], SamplingParams(max_tokens=20)))"
All tests pass: Import, compilation, CUDA graphs, and text generation all work correctly.