first commit
This commit is contained in:
225
examples/README.md
Normal file
225
examples/README.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# vLLM Examples for DGX Spark
|
||||
|
||||
This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Ensure vLLM is installed and the environment is activated:
|
||||
|
||||
```bash
|
||||
# Assuming vllm-install is in your home directory
|
||||
source ~/vllm-install/vllm_env.sh
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### 1. Basic Inference (`basic_inference.py`)
|
||||
|
||||
Simple text generation using the vLLM Python API.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python basic_inference.py
|
||||
```
|
||||
|
||||
**What it demonstrates:**
|
||||
- Loading a model with vLLM
|
||||
- Configuring sampling parameters
|
||||
- Generating multiple completions
|
||||
- Batch processing
|
||||
|
||||
### 2. API Client (`api_client.py`)
|
||||
|
||||
Using vLLM's OpenAI-compatible REST API.
|
||||
|
||||
**Prerequisites:**
|
||||
Start the vLLM server first:
|
||||
```bash
|
||||
cd ~/vllm-install
|
||||
./vllm-serve.sh
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python api_client.py
|
||||
```
|
||||
|
||||
**What it demonstrates:**
|
||||
- Listing available models
|
||||
- Simple text completion
|
||||
- Chat completion
|
||||
- Streaming responses
|
||||
- HTTP API interaction
|
||||
|
||||
### 3. Batch Processing (`batch_processing.py`)
|
||||
|
||||
Efficient processing of large batches of prompts.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python batch_processing.py
|
||||
```
|
||||
|
||||
**What it demonstrates:**
|
||||
- High-throughput batch inference
|
||||
- Dynamic batching
|
||||
- Memory-efficient processing
|
||||
- Performance monitoring
|
||||
|
||||
## Customization
|
||||
|
||||
### Change Model
|
||||
|
||||
Edit the model name in any example:
|
||||
|
||||
```python
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct", # Change this
|
||||
trust_remote_code=True,
|
||||
gpu_memory_utilization=0.9
|
||||
)
|
||||
```
|
||||
|
||||
### Adjust Sampling Parameters
|
||||
|
||||
Modify `SamplingParams` for different generation behavior:
|
||||
|
||||
```python
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0.7, # Lower = more deterministic (0.0-1.0)
|
||||
top_p=0.95, # Nucleus sampling threshold
|
||||
max_tokens=100, # Maximum tokens to generate
|
||||
top_k=50, # Top-k sampling
|
||||
repetition_penalty=1.1 # Penalize repetition
|
||||
)
|
||||
```
|
||||
|
||||
### GPU Memory Management
|
||||
|
||||
Adjust memory utilization:
|
||||
|
||||
```python
|
||||
llm = LLM(
|
||||
model="...",
|
||||
gpu_memory_utilization=0.9, # Use 90% of GPU memory (0.0-1.0)
|
||||
max_model_len=2048 # Maximum sequence length
|
||||
)
|
||||
```
|
||||
|
||||
## API Server Examples
|
||||
|
||||
### cURL Examples
|
||||
|
||||
**List models:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
**Simple completion:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "The meaning of life is",
|
||||
"max_tokens": 50,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
**Chat completion:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is DGX Spark?"}
|
||||
],
|
||||
"max_tokens": 100,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
**Streaming completion:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Write a story about",
|
||||
"max_tokens": 100,
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
## Tested Models
|
||||
|
||||
These models work well on DGX Spark GB10:
|
||||
|
||||
- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
|
||||
- `Qwen/Qwen2.5-7B-Instruct` (balanced)
|
||||
- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
|
||||
- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use GPU memory efficiently:**
|
||||
- Set `gpu_memory_utilization=0.95` for maximum throughput
|
||||
- Lower for models close to GPU memory limit
|
||||
|
||||
2. **Batch processing:**
|
||||
- Process multiple prompts together
|
||||
- vLLM automatically optimizes batch sizes
|
||||
|
||||
3. **Quantization:**
|
||||
- For larger models, use quantization:
|
||||
```python
|
||||
llm = LLM(model="...", quantization="awq")
|
||||
```
|
||||
|
||||
4. **Tensor parallelism:**
|
||||
- For models > 20GB, use multiple GPUs:
|
||||
```python
|
||||
llm = LLM(model="...", tensor_parallel_size=2)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Out of Memory
|
||||
|
||||
Reduce `max_model_len` or `gpu_memory_utilization`:
|
||||
|
||||
```python
|
||||
llm = LLM(
|
||||
model="...",
|
||||
gpu_memory_utilization=0.8,
|
||||
max_model_len=2048
|
||||
)
|
||||
```
|
||||
|
||||
### Slow Generation
|
||||
|
||||
Check if model is loaded correctly:
|
||||
|
||||
```python
|
||||
python -c "import vllm; print(vllm.__version__)"
|
||||
nvidia-smi # Check GPU utilization
|
||||
```
|
||||
|
||||
### Connection Refused (API)
|
||||
|
||||
Ensure server is running:
|
||||
|
||||
```bash
|
||||
cd ~/vllm-install
|
||||
./vllm-status.sh
|
||||
```
|
||||
|
||||
## More Resources
|
||||
|
||||
- [vLLM Documentation](https://docs.vllm.ai/)
|
||||
- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
|
||||
- [Main README](../README.md)
|
||||
- [Cluster Setup](../CLUSTER.md)
|
||||
160
examples/api_client.py
Normal file
160
examples/api_client.py
Normal file
@@ -0,0 +1,160 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
vLLM OpenAI-Compatible API Client Example
|
||||
Demonstrates using vLLM's OpenAI-compatible API endpoints
|
||||
"""
|
||||
|
||||
import requests
|
||||
import json
|
||||
from typing import Dict, List
|
||||
|
||||
class VLLMClient:
|
||||
"""Simple client for vLLM OpenAI-compatible API"""
|
||||
|
||||
def __init__(self, base_url: str = "http://localhost:8000"):
|
||||
self.base_url = base_url.rstrip('/')
|
||||
|
||||
def list_models(self) -> List[Dict]:
|
||||
"""List available models"""
|
||||
response = requests.get(f"{self.base_url}/v1/models")
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def complete(
|
||||
self,
|
||||
prompt: str,
|
||||
model: str = None,
|
||||
max_tokens: int = 100,
|
||||
temperature: float = 0.7,
|
||||
stream: bool = False
|
||||
) -> Dict:
|
||||
"""Generate completion"""
|
||||
|
||||
# Get model name if not specified
|
||||
if model is None:
|
||||
models = self.list_models()
|
||||
model = models['data'][0]['id']
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"max_tokens": max_tokens,
|
||||
"temperature": temperature,
|
||||
"stream": stream
|
||||
}
|
||||
|
||||
response = requests.post(
|
||||
f"{self.base_url}/v1/completions",
|
||||
json=payload,
|
||||
headers={"Content-Type": "application/json"},
|
||||
stream=stream
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
if stream:
|
||||
return response.iter_lines()
|
||||
else:
|
||||
return response.json()
|
||||
|
||||
def chat(
|
||||
self,
|
||||
messages: List[Dict[str, str]],
|
||||
model: str = None,
|
||||
max_tokens: int = 100,
|
||||
temperature: float = 0.7,
|
||||
stream: bool = False
|
||||
) -> Dict:
|
||||
"""Generate chat completion"""
|
||||
|
||||
# Get model name if not specified
|
||||
if model is None:
|
||||
models = self.list_models()
|
||||
model = models['data'][0]['id']
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"max_tokens": max_tokens,
|
||||
"temperature": temperature,
|
||||
"stream": stream
|
||||
}
|
||||
|
||||
response = requests.post(
|
||||
f"{self.base_url}/v1/chat/completions",
|
||||
json=payload,
|
||||
headers={"Content-Type": "application/json"},
|
||||
stream=stream
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
if stream:
|
||||
return response.iter_lines()
|
||||
else:
|
||||
return response.json()
|
||||
|
||||
|
||||
def main():
|
||||
# Initialize client
|
||||
client = VLLMClient("http://localhost:8000")
|
||||
|
||||
print("="*60)
|
||||
print("vLLM API Client Examples")
|
||||
print("="*60)
|
||||
|
||||
# Example 1: List models
|
||||
print("\n1. Listing available models...")
|
||||
models = client.list_models()
|
||||
for model in models['data']:
|
||||
print(f" - {model['id']}")
|
||||
|
||||
# Example 2: Simple completion
|
||||
print("\n2. Simple completion...")
|
||||
result = client.complete(
|
||||
prompt="The capital of France is",
|
||||
max_tokens=10,
|
||||
temperature=0.0
|
||||
)
|
||||
print(f" Prompt: The capital of France is")
|
||||
print(f" Response: {result['choices'][0]['text']}")
|
||||
|
||||
# Example 3: Chat completion
|
||||
print("\n3. Chat completion...")
|
||||
messages = [
|
||||
{"role": "system", "content": "You are a helpful AI assistant."},
|
||||
{"role": "user", "content": "What is the Blackwell GPU architecture?"}
|
||||
]
|
||||
result = client.chat(
|
||||
messages=messages,
|
||||
max_tokens=100,
|
||||
temperature=0.7
|
||||
)
|
||||
print(f" User: {messages[1]['content']}")
|
||||
print(f" Assistant: {result['choices'][0]['message']['content']}")
|
||||
|
||||
# Example 4: Streaming completion
|
||||
print("\n4. Streaming completion...")
|
||||
print(" Prompt: Write a short poem about AI")
|
||||
print(" Response: ", end="", flush=True)
|
||||
|
||||
stream = client.complete(
|
||||
prompt="Write a short poem about AI",
|
||||
max_tokens=50,
|
||||
temperature=0.8,
|
||||
stream=True
|
||||
)
|
||||
|
||||
for line in stream:
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line.decode('utf-8').removeprefix('data: '))
|
||||
if 'choices' in data and len(data['choices']) > 0:
|
||||
token = data['choices'][0].get('text', '')
|
||||
print(token, end="", flush=True)
|
||||
except (json.JSONDecodeError, AttributeError):
|
||||
pass
|
||||
|
||||
print("\n")
|
||||
print("="*60)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
48
examples/basic_inference.py
Normal file
48
examples/basic_inference.py
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Basic vLLM Inference Example for DGX Spark
|
||||
Demonstrates simple text generation using the vLLM Python API
|
||||
"""
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
def main():
|
||||
# Initialize the model
|
||||
# Use a smaller model for testing, replace with your preferred model
|
||||
print("Loading model...")
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-0.5B-Instruct",
|
||||
trust_remote_code=True,
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=2048
|
||||
)
|
||||
|
||||
# Define prompts
|
||||
prompts = [
|
||||
"What is the NVIDIA DGX Spark?",
|
||||
"Explain the Blackwell GPU architecture in simple terms.",
|
||||
"Write a haiku about artificial intelligence."
|
||||
]
|
||||
|
||||
# Configure sampling parameters
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0.7,
|
||||
top_p=0.95,
|
||||
max_tokens=100,
|
||||
stop=["</s>", "\n\n\n"]
|
||||
)
|
||||
|
||||
# Generate responses
|
||||
print("\nGenerating responses...\n")
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
# Print results
|
||||
for i, output in enumerate(outputs):
|
||||
print(f"{'='*60}")
|
||||
print(f"Prompt {i+1}: {prompts[i]}")
|
||||
print(f"{'-'*60}")
|
||||
print(f"Response: {output.outputs[0].text}")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user