first commit

This commit is contained in:
2026-03-22 17:26:26 -04:00
commit c05cb71816
15 changed files with 2644 additions and 0 deletions

225
examples/README.md Normal file
View File

@@ -0,0 +1,225 @@
# vLLM Examples for DGX Spark
This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
## Prerequisites
Ensure vLLM is installed and the environment is activated:
```bash
# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh
```
## Examples
### 1. Basic Inference (`basic_inference.py`)
Simple text generation using the vLLM Python API.
**Usage:**
```bash
python basic_inference.py
```
**What it demonstrates:**
- Loading a model with vLLM
- Configuring sampling parameters
- Generating multiple completions
- Batch processing
### 2. API Client (`api_client.py`)
Using vLLM's OpenAI-compatible REST API.
**Prerequisites:**
Start the vLLM server first:
```bash
cd ~/vllm-install
./vllm-serve.sh
```
**Usage:**
```bash
python api_client.py
```
**What it demonstrates:**
- Listing available models
- Simple text completion
- Chat completion
- Streaming responses
- HTTP API interaction
### 3. Batch Processing (`batch_processing.py`)
Efficient processing of large batches of prompts.
**Usage:**
```bash
python batch_processing.py
```
**What it demonstrates:**
- High-throughput batch inference
- Dynamic batching
- Memory-efficient processing
- Performance monitoring
## Customization
### Change Model
Edit the model name in any example:
```python
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct", # Change this
trust_remote_code=True,
gpu_memory_utilization=0.9
)
```
### Adjust Sampling Parameters
Modify `SamplingParams` for different generation behavior:
```python
sampling_params = SamplingParams(
temperature=0.7, # Lower = more deterministic (0.0-1.0)
top_p=0.95, # Nucleus sampling threshold
max_tokens=100, # Maximum tokens to generate
top_k=50, # Top-k sampling
repetition_penalty=1.1 # Penalize repetition
)
```
### GPU Memory Management
Adjust memory utilization:
```python
llm = LLM(
model="...",
gpu_memory_utilization=0.9, # Use 90% of GPU memory (0.0-1.0)
max_model_len=2048 # Maximum sequence length
)
```
## API Server Examples
### cURL Examples
**List models:**
```bash
curl http://localhost:8000/v1/models
```
**Simple completion:**
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "The meaning of life is",
"max_tokens": 50,
"temperature": 0.7
}'
```
**Chat completion:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is DGX Spark?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
```
**Streaming completion:**
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Write a story about",
"max_tokens": 100,
"stream": true
}'
```
## Tested Models
These models work well on DGX Spark GB10:
- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
- `Qwen/Qwen2.5-7B-Instruct` (balanced)
- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)
## Performance Tips
1. **Use GPU memory efficiently:**
- Set `gpu_memory_utilization=0.95` for maximum throughput
- Lower for models close to GPU memory limit
2. **Batch processing:**
- Process multiple prompts together
- vLLM automatically optimizes batch sizes
3. **Quantization:**
- For larger models, use quantization:
```python
llm = LLM(model="...", quantization="awq")
```
4. **Tensor parallelism:**
- For models > 20GB, use multiple GPUs:
```python
llm = LLM(model="...", tensor_parallel_size=2)
```
## Troubleshooting
### Out of Memory
Reduce `max_model_len` or `gpu_memory_utilization`:
```python
llm = LLM(
model="...",
gpu_memory_utilization=0.8,
max_model_len=2048
)
```
### Slow Generation
Check if model is loaded correctly:
```python
python -c "import vllm; print(vllm.__version__)"
nvidia-smi # Check GPU utilization
```
### Connection Refused (API)
Ensure server is running:
```bash
cd ~/vllm-install
./vllm-status.sh
```
## More Resources
- [vLLM Documentation](https://docs.vllm.ai/)
- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
- [Main README](../README.md)
- [Cluster Setup](../CLUSTER.md)

160
examples/api_client.py Normal file
View File

@@ -0,0 +1,160 @@
#!/usr/bin/env python3
"""
vLLM OpenAI-Compatible API Client Example
Demonstrates using vLLM's OpenAI-compatible API endpoints
"""
import requests
import json
from typing import Dict, List
class VLLMClient:
"""Simple client for vLLM OpenAI-compatible API"""
def __init__(self, base_url: str = "http://localhost:8000"):
self.base_url = base_url.rstrip('/')
def list_models(self) -> List[Dict]:
"""List available models"""
response = requests.get(f"{self.base_url}/v1/models")
response.raise_for_status()
return response.json()
def complete(
self,
prompt: str,
model: str = None,
max_tokens: int = 100,
temperature: float = 0.7,
stream: bool = False
) -> Dict:
"""Generate completion"""
# Get model name if not specified
if model is None:
models = self.list_models()
model = models['data'][0]['id']
payload = {
"model": model,
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream
}
response = requests.post(
f"{self.base_url}/v1/completions",
json=payload,
headers={"Content-Type": "application/json"},
stream=stream
)
response.raise_for_status()
if stream:
return response.iter_lines()
else:
return response.json()
def chat(
self,
messages: List[Dict[str, str]],
model: str = None,
max_tokens: int = 100,
temperature: float = 0.7,
stream: bool = False
) -> Dict:
"""Generate chat completion"""
# Get model name if not specified
if model is None:
models = self.list_models()
model = models['data'][0]['id']
payload = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream
}
response = requests.post(
f"{self.base_url}/v1/chat/completions",
json=payload,
headers={"Content-Type": "application/json"},
stream=stream
)
response.raise_for_status()
if stream:
return response.iter_lines()
else:
return response.json()
def main():
# Initialize client
client = VLLMClient("http://localhost:8000")
print("="*60)
print("vLLM API Client Examples")
print("="*60)
# Example 1: List models
print("\n1. Listing available models...")
models = client.list_models()
for model in models['data']:
print(f" - {model['id']}")
# Example 2: Simple completion
print("\n2. Simple completion...")
result = client.complete(
prompt="The capital of France is",
max_tokens=10,
temperature=0.0
)
print(f" Prompt: The capital of France is")
print(f" Response: {result['choices'][0]['text']}")
# Example 3: Chat completion
print("\n3. Chat completion...")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the Blackwell GPU architecture?"}
]
result = client.chat(
messages=messages,
max_tokens=100,
temperature=0.7
)
print(f" User: {messages[1]['content']}")
print(f" Assistant: {result['choices'][0]['message']['content']}")
# Example 4: Streaming completion
print("\n4. Streaming completion...")
print(" Prompt: Write a short poem about AI")
print(" Response: ", end="", flush=True)
stream = client.complete(
prompt="Write a short poem about AI",
max_tokens=50,
temperature=0.8,
stream=True
)
for line in stream:
if line:
try:
data = json.loads(line.decode('utf-8').removeprefix('data: '))
if 'choices' in data and len(data['choices']) > 0:
token = data['choices'][0].get('text', '')
print(token, end="", flush=True)
except (json.JSONDecodeError, AttributeError):
pass
print("\n")
print("="*60)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,48 @@
#!/usr/bin/env python3
"""
Basic vLLM Inference Example for DGX Spark
Demonstrates simple text generation using the vLLM Python API
"""
from vllm import LLM, SamplingParams
def main():
# Initialize the model
# Use a smaller model for testing, replace with your preferred model
print("Loading model...")
llm = LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
trust_remote_code=True,
gpu_memory_utilization=0.9,
max_model_len=2048
)
# Define prompts
prompts = [
"What is the NVIDIA DGX Spark?",
"Explain the Blackwell GPU architecture in simple terms.",
"Write a haiku about artificial intelligence."
]
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=100,
stop=["</s>", "\n\n\n"]
)
# Generate responses
print("\nGenerating responses...\n")
outputs = llm.generate(prompts, sampling_params)
# Print results
for i, output in enumerate(outputs):
print(f"{'='*60}")
print(f"Prompt {i+1}: {prompts[i]}")
print(f"{'-'*60}")
print(f"Response: {output.outputs[0].text}")
print(f"{'='*60}\n")
if __name__ == "__main__":
main()