TBNilles/dgx-spark-vllm-setup

Fork 0

Files

History

Thomas Nilles c05cb71816 first commit

2026-03-22 17:26:26 -04:00

api_client.py

first commit

2026-03-22 17:26:26 -04:00

basic_inference.py

first commit

2026-03-22 17:26:26 -04:00

README.md

first commit

2026-03-22 17:26:26 -04:00

README.md

vLLM Examples for DGX Spark

This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.

Prerequisites

Ensure vLLM is installed and the environment is activated:

# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh

Examples

1. Basic Inference (`basic_inference.py`)

Simple text generation using the vLLM Python API.

Usage:

python basic_inference.py

What it demonstrates:

Loading a model with vLLM
Configuring sampling parameters
Generating multiple completions
Batch processing

2. API Client (`api_client.py`)

Using vLLM's OpenAI-compatible REST API.

Prerequisites: Start the vLLM server first:

cd ~/vllm-install
./vllm-serve.sh

Usage:

python api_client.py

What it demonstrates:

Listing available models
Simple text completion
Chat completion
Streaming responses
HTTP API interaction

3. Batch Processing (`batch_processing.py`)

Efficient processing of large batches of prompts.

Usage:

python batch_processing.py

What it demonstrates:

High-throughput batch inference
Dynamic batching
Memory-efficient processing
Performance monitoring

Customization

Change Model

Edit the model name in any example:

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

Adjust Sampling Parameters

Modify SamplingParams for different generation behavior:

sampling_params = SamplingParams(
    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
    top_p=0.95,          # Nucleus sampling threshold
    max_tokens=100,      # Maximum tokens to generate
    top_k=50,            # Top-k sampling
    repetition_penalty=1.1  # Penalize repetition
)

GPU Memory Management

Adjust memory utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
    max_model_len=2048           # Maximum sequence length
)

API Server Examples

cURL Examples

List models:

curl http://localhost:8000/v1/models

Simple completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "The meaning of life is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is DGX Spark?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Write a story about",
    "max_tokens": 100,
    "stream": true
  }'

Tested Models

These models work well on DGX Spark GB10:

Qwen/Qwen2.5-0.5B-Instruct (small, fast)
Qwen/Qwen2.5-7B-Instruct (balanced)
meta-llama/Llama-3.1-8B-Instruct (high quality)
meta-llama/Llama-3.1-70B-Instruct (requires tensor parallelism)

Performance Tips

Use GPU memory efficiently:
- Set gpu_memory_utilization=0.95 for maximum throughput
- Lower for models close to GPU memory limit
Batch processing:
- Process multiple prompts together
- vLLM automatically optimizes batch sizes
Quantization:
- For larger models, use quantization:
```
llm = LLM(model="...", quantization="awq")
```
Tensor parallelism:
- For models > 20GB, use multiple GPUs:
```
llm = LLM(model="...", tensor_parallel_size=2)
```

Troubleshooting

Out of Memory

Reduce max_model_len or gpu_memory_utilization:

llm = LLM(
    model="...",
    gpu_memory_utilization=0.8,
    max_model_len=2048
)

Slow Generation

Check if model is loaded correctly:

python -c "import vllm; print(vllm.__version__)"
nvidia-smi  # Check GPU utilization

Connection Refused (API)

Ensure server is running:

cd ~/vllm-install
./vllm-status.sh

README.md

vLLM Examples for DGX Spark

Prerequisites

Examples

1. Basic Inference (basic_inference.py)

2. API Client (api_client.py)

3. Batch Processing (batch_processing.py)

Customization

Change Model

Adjust Sampling Parameters

GPU Memory Management

API Server Examples

cURL Examples

Tested Models

Performance Tips

Troubleshooting

Out of Memory

Slow Generation

Connection Refused (API)

More Resources

1. Basic Inference (`basic_inference.py`)

2. API Client (`api_client.py`)

3. Batch Processing (`batch_processing.py`)