Files
2026-03-22 17:26:26 -04:00

3.7 KiB

Critical Blackwell GB10 Fixes for vLLM

Overview

Three critical fixes are required for vLLM on Blackwell GB10 (sm_121a) GPUs with CUDA 13.0+:

  1. CMakeLists.txt SM120 Support - Add missing architecture
  2. vLLM Commit Version - Use commit with Blackwell/Triton fixes
  3. Triton Version Pinning - Use tested working commit

Fix 1: CMakeLists.txt SM120 Support

Root Cause

vLLM v0.11.1rc3 CMakeLists.txt has incomplete architecture support for Blackwell GB10 (sm_121a) MOE kernels when using CUDA 13.0+.

The Problem

For CUDA 13.0+, the code uses these branches:

  • Line 490: Regular MOE kernels
  • Line 671: Grouped MM MOE kernels

Original v0.11.1rc3:

# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")

# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")

BOTH lines are missing 12.0f (SM120) support!

The Fix

Both lines need 12.0f added:

# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

Error Symptoms

Without this fix:

ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb

The MOE kernels for SM100/SM120 aren't compiled, causing import failures.

Why install.sh Works

The sed command on line 323:

sed -i 's/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f"/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f"/' CMakeLists.txt

This replaces ALL occurrences, fixing both lines 490 and 671 in one command.

Verified Solution

Tested on NVIDIA DGX Spark with Blackwell GB10, CUDA 13.0:

  • [OK] Line 490 fixed: "10.0f;11.0f;12.0f"
  • [OK] Line 671 fixed: "10.0f;11.0f;12.0f"
  • [OK] vLLM imports successfully
  • [OK] No cutlass_moe_mm_sm100 symbol errors
  • [OK] Build time: ~19 minutes

Fix 2: vLLM Commit Version

Issue

vLLM tag v0.11.1rc3 lacks critical Triton/PyTorch Inductor fixes for Blackwell.

Solution

Use commit 66a168a197ba214a5b70a74fa2e713c9eeb3251a (6 commits ahead of v0.11.1rc3):

  • Contains Triton JIT compilation fixes
  • Includes PyTorch Inductor optimizations for Blackwell
  • Adds proper backend registration handling

Installation

cd vllm
git checkout 66a168a197ba214a5b70a74fa2e713c9eeb3251a
git submodule update --init --recursive

Fix 3: Triton Version Pinning

Issue

Latest Triton main branch (as of late October 2025) has intermittent JITFunction compilation issues with PyTorch Inductor on Blackwell.

Solution

Pin to tested working commit: 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a (October 25, 2025)

  • Verified stable with Blackwell GB10
  • Passes all compilation tests
  • No JITFunction.constexprs errors

Installation

cd triton
git checkout 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a
git submodule update --init --recursive
python -m pip install --no-build-isolation -v .

Complete Verified Configuration

Component Version/Commit Notes
vLLM 66a168a197ba214a5b70a74fa2e713c9eeb3251a 6 commits ahead of v0.11.1rc3
Triton 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a October 25, 2025
PyTorch 2.9.0+cu130 From vLLM requirements
CUDA 13.0 (V13.0.88) System CUDA
Python 3.12.3

Testing

Verified working with:

python -c "from vllm import LLM, SamplingParams; \
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', max_model_len=512); \
print(llm.generate(['Hello'], SamplingParams(max_tokens=20)))"

All tests pass: Import, compilation, CUDA graphs, and text generation all work correctly.