Files
2026-03-22 17:26:26 -04:00
..
2026-03-22 17:26:26 -04:00
2026-03-22 17:26:26 -04:00
2026-03-22 17:26:26 -04:00

vLLM Examples for DGX Spark

This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.

Prerequisites

Ensure vLLM is installed and the environment is activated:

# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh

Examples

1. Basic Inference (basic_inference.py)

Simple text generation using the vLLM Python API.

Usage:

python basic_inference.py

What it demonstrates:

  • Loading a model with vLLM
  • Configuring sampling parameters
  • Generating multiple completions
  • Batch processing

2. API Client (api_client.py)

Using vLLM's OpenAI-compatible REST API.

Prerequisites: Start the vLLM server first:

cd ~/vllm-install
./vllm-serve.sh

Usage:

python api_client.py

What it demonstrates:

  • Listing available models
  • Simple text completion
  • Chat completion
  • Streaming responses
  • HTTP API interaction

3. Batch Processing (batch_processing.py)

Efficient processing of large batches of prompts.

Usage:

python batch_processing.py

What it demonstrates:

  • High-throughput batch inference
  • Dynamic batching
  • Memory-efficient processing
  • Performance monitoring

Customization

Change Model

Edit the model name in any example:

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

Adjust Sampling Parameters

Modify SamplingParams for different generation behavior:

sampling_params = SamplingParams(
    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
    top_p=0.95,          # Nucleus sampling threshold
    max_tokens=100,      # Maximum tokens to generate
    top_k=50,            # Top-k sampling
    repetition_penalty=1.1  # Penalize repetition
)

GPU Memory Management

Adjust memory utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
    max_model_len=2048           # Maximum sequence length
)

API Server Examples

cURL Examples

List models:

curl http://localhost:8000/v1/models

Simple completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "The meaning of life is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is DGX Spark?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Write a story about",
    "max_tokens": 100,
    "stream": true
  }'

Tested Models

These models work well on DGX Spark GB10:

  • Qwen/Qwen2.5-0.5B-Instruct (small, fast)
  • Qwen/Qwen2.5-7B-Instruct (balanced)
  • meta-llama/Llama-3.1-8B-Instruct (high quality)
  • meta-llama/Llama-3.1-70B-Instruct (requires tensor parallelism)

Performance Tips

  1. Use GPU memory efficiently:

    • Set gpu_memory_utilization=0.95 for maximum throughput
    • Lower for models close to GPU memory limit
  2. Batch processing:

    • Process multiple prompts together
    • vLLM automatically optimizes batch sizes
  3. Quantization:

    • For larger models, use quantization:
    llm = LLM(model="...", quantization="awq")
    
  4. Tensor parallelism:

    • For models > 20GB, use multiple GPUs:
    llm = LLM(model="...", tensor_parallel_size=2)
    

Troubleshooting

Out of Memory

Reduce max_model_len or gpu_memory_utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.8,
    max_model_len=2048
)

Slow Generation

Check if model is loaded correctly:

python -c "import vllm; print(vllm.__version__)"
nvidia-smi  # Check GPU utilization

Connection Refused (API)

Ensure server is running:

cd ~/vllm-install
./vllm-status.sh

More Resources