Files
dgx-spark-vllm-setup/examples/README.md
2026-03-22 17:26:26 -04:00

4.5 KiB

vLLM Examples for DGX Spark

This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.

Prerequisites

Ensure vLLM is installed and the environment is activated:

# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh

Examples

1. Basic Inference (basic_inference.py)

Simple text generation using the vLLM Python API.

Usage:

python basic_inference.py

What it demonstrates:

  • Loading a model with vLLM
  • Configuring sampling parameters
  • Generating multiple completions
  • Batch processing

2. API Client (api_client.py)

Using vLLM's OpenAI-compatible REST API.

Prerequisites: Start the vLLM server first:

cd ~/vllm-install
./vllm-serve.sh

Usage:

python api_client.py

What it demonstrates:

  • Listing available models
  • Simple text completion
  • Chat completion
  • Streaming responses
  • HTTP API interaction

3. Batch Processing (batch_processing.py)

Efficient processing of large batches of prompts.

Usage:

python batch_processing.py

What it demonstrates:

  • High-throughput batch inference
  • Dynamic batching
  • Memory-efficient processing
  • Performance monitoring

Customization

Change Model

Edit the model name in any example:

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

Adjust Sampling Parameters

Modify SamplingParams for different generation behavior:

sampling_params = SamplingParams(
    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
    top_p=0.95,          # Nucleus sampling threshold
    max_tokens=100,      # Maximum tokens to generate
    top_k=50,            # Top-k sampling
    repetition_penalty=1.1  # Penalize repetition
)

GPU Memory Management

Adjust memory utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
    max_model_len=2048           # Maximum sequence length
)

API Server Examples

cURL Examples

List models:

curl http://localhost:8000/v1/models

Simple completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "The meaning of life is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is DGX Spark?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Write a story about",
    "max_tokens": 100,
    "stream": true
  }'

Tested Models

These models work well on DGX Spark GB10:

  • Qwen/Qwen2.5-0.5B-Instruct (small, fast)
  • Qwen/Qwen2.5-7B-Instruct (balanced)
  • meta-llama/Llama-3.1-8B-Instruct (high quality)
  • meta-llama/Llama-3.1-70B-Instruct (requires tensor parallelism)

Performance Tips

  1. Use GPU memory efficiently:

    • Set gpu_memory_utilization=0.95 for maximum throughput
    • Lower for models close to GPU memory limit
  2. Batch processing:

    • Process multiple prompts together
    • vLLM automatically optimizes batch sizes
  3. Quantization:

    • For larger models, use quantization:
    llm = LLM(model="...", quantization="awq")
    
  4. Tensor parallelism:

    • For models > 20GB, use multiple GPUs:
    llm = LLM(model="...", tensor_parallel_size=2)
    

Troubleshooting

Out of Memory

Reduce max_model_len or gpu_memory_utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.8,
    max_model_len=2048
)

Slow Generation

Check if model is loaded correctly:

python -c "import vllm; print(vllm.__version__)"
nvidia-smi  # Check GPU utilization

Connection Refused (API)

Ensure server is running:

cd ~/vllm-install
./vllm-status.sh

More Resources