TBNilles/dgx-spark-vllm-setup

Fork 0

Files

Thomas Nilles c05cb71816 first commit

2026-03-22 17:26:26 -04:00

3.7 KiB

Raw Blame History

Critical Blackwell GB10 Fixes for vLLM

Overview

Three critical fixes are required for vLLM on Blackwell GB10 (sm_121a) GPUs with CUDA 13.0+:

CMakeLists.txt SM120 Support - Add missing architecture
vLLM Commit Version - Use commit with Blackwell/Triton fixes
Triton Version Pinning - Use tested working commit

Fix 1: CMakeLists.txt SM120 Support

Root Cause

vLLM v0.11.1rc3 CMakeLists.txt has incomplete architecture support for Blackwell GB10 (sm_121a) MOE kernels when using CUDA 13.0+.

The Problem

For CUDA 13.0+, the code uses these branches:

Line 490: Regular MOE kernels
Line 671: Grouped MM MOE kernels

Original v0.11.1rc3:

# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")

# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")

BOTH lines are missing 12.0f (SM120) support!

The Fix

Both lines need 12.0f added:

# Line 490
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

# Line 671
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

Error Symptoms

Without this fix:

ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb

The MOE kernels for SM100/SM120 aren't compiled, causing import failures.

Why install.sh Works

The sed command on line 323:

sed -i 's/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f"/cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f"/' CMakeLists.txt

This replaces ALL occurrences, fixing both lines 490 and 671 in one command.

Verified Solution

Tested on NVIDIA DGX Spark with Blackwell GB10, CUDA 13.0:

[OK] Line 490 fixed: "10.0f;11.0f;12.0f"
[OK] Line 671 fixed: "10.0f;11.0f;12.0f"
[OK] vLLM imports successfully
[OK] No cutlass_moe_mm_sm100 symbol errors
[OK] Build time: ~19 minutes

Fix 2: vLLM Commit Version

Issue

vLLM tag v0.11.1rc3 lacks critical Triton/PyTorch Inductor fixes for Blackwell.

Solution

Use commit 66a168a197ba214a5b70a74fa2e713c9eeb3251a (6 commits ahead of v0.11.1rc3):

Contains Triton JIT compilation fixes
Includes PyTorch Inductor optimizations for Blackwell
Adds proper backend registration handling

Installation

cd vllm
git checkout 66a168a197ba214a5b70a74fa2e713c9eeb3251a
git submodule update --init --recursive

Fix 3: Triton Version Pinning

Issue

Latest Triton main branch (as of late October 2025) has intermittent JITFunction compilation issues with PyTorch Inductor on Blackwell.

Solution

Pin to tested working commit: 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a (October 25, 2025)

Verified stable with Blackwell GB10
Passes all compilation tests
No JITFunction.constexprs errors

Installation

cd triton
git checkout 4caa0328bf8df64896dd5f6fb9df41b0eb2e750a
git submodule update --init --recursive
python -m pip install --no-build-isolation -v .

Complete Verified Configuration

Component	Version/Commit	Notes
vLLM	`66a168a197ba214a5b70a74fa2e713c9eeb3251a`	6 commits ahead of v0.11.1rc3
Triton	`4caa0328bf8df64896dd5f6fb9df41b0eb2e750a`	October 25, 2025
PyTorch	`2.9.0+cu130`	From vLLM requirements
CUDA	`13.0` (V13.0.88)	System CUDA
Python	`3.12.3`

Testing

Verified working with:

python -c "from vllm import LLM, SamplingParams; \
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', max_model_len=512); \
print(llm.generate(['Hello'], SamplingParams(max_tokens=20)))"

All tests pass: Import, compilation, CUDA graphs, and text generation all work correctly.

3.7 KiB Raw Blame History

Critical Blackwell GB10 Fixes for vLLM

Overview

Fix 1: CMakeLists.txt SM120 Support

Root Cause

The Problem

The Fix

Error Symptoms

Why install.sh Works

Verified Solution

Fix 2: vLLM Commit Version

Issue

Solution

Installation

Fix 3: Triton Version Pinning

Issue

Solution

Installation

Complete Verified Configuration

Testing

3.7 KiB

Raw Blame History