vLLM Examples for DGX Spark
This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
Prerequisites
Ensure vLLM is installed and the environment is activated:
# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh
Examples
1. Basic Inference (basic_inference.py)
Simple text generation using the vLLM Python API.
Usage:
python basic_inference.py
What it demonstrates:
- Loading a model with vLLM
- Configuring sampling parameters
- Generating multiple completions
- Batch processing
2. API Client (api_client.py)
Using vLLM's OpenAI-compatible REST API.
Prerequisites: Start the vLLM server first:
cd ~/vllm-install
./vllm-serve.sh
Usage:
python api_client.py
What it demonstrates:
- Listing available models
- Simple text completion
- Chat completion
- Streaming responses
- HTTP API interaction
3. Batch Processing (batch_processing.py)
Efficient processing of large batches of prompts.
Usage:
python batch_processing.py
What it demonstrates:
- High-throughput batch inference
- Dynamic batching
- Memory-efficient processing
- Performance monitoring
Customization
Change Model
Edit the model name in any example:
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct", # Change this
trust_remote_code=True,
gpu_memory_utilization=0.9
)
Adjust Sampling Parameters
Modify SamplingParams for different generation behavior:
sampling_params = SamplingParams(
temperature=0.7, # Lower = more deterministic (0.0-1.0)
top_p=0.95, # Nucleus sampling threshold
max_tokens=100, # Maximum tokens to generate
top_k=50, # Top-k sampling
repetition_penalty=1.1 # Penalize repetition
)
GPU Memory Management
Adjust memory utilization:
llm = LLM(
model="...",
gpu_memory_utilization=0.9, # Use 90% of GPU memory (0.0-1.0)
max_model_len=2048 # Maximum sequence length
)
API Server Examples
cURL Examples
List models:
curl http://localhost:8000/v1/models
Simple completion:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "The meaning of life is",
"max_tokens": 50,
"temperature": 0.7
}'
Chat completion:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is DGX Spark?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Streaming completion:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Write a story about",
"max_tokens": 100,
"stream": true
}'
Tested Models
These models work well on DGX Spark GB10:
Qwen/Qwen2.5-0.5B-Instruct(small, fast)Qwen/Qwen2.5-7B-Instruct(balanced)meta-llama/Llama-3.1-8B-Instruct(high quality)meta-llama/Llama-3.1-70B-Instruct(requires tensor parallelism)
Performance Tips
-
Use GPU memory efficiently:
- Set
gpu_memory_utilization=0.95for maximum throughput - Lower for models close to GPU memory limit
- Set
-
Batch processing:
- Process multiple prompts together
- vLLM automatically optimizes batch sizes
-
Quantization:
- For larger models, use quantization:
llm = LLM(model="...", quantization="awq") -
Tensor parallelism:
- For models > 20GB, use multiple GPUs:
llm = LLM(model="...", tensor_parallel_size=2)
Troubleshooting
Out of Memory
Reduce max_model_len or gpu_memory_utilization:
llm = LLM(
model="...",
gpu_memory_utilization=0.8,
max_model_len=2048
)
Slow Generation
Check if model is loaded correctly:
python -c "import vllm; print(vllm.__version__)"
nvidia-smi # Check GPU utilization
Connection Refused (API)
Ensure server is running:
cd ~/vllm-install
./vllm-status.sh