226 lines
4.5 KiB
Markdown
226 lines
4.5 KiB
Markdown
# vLLM Examples for DGX Spark
|
|
|
|
This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
|
|
|
|
## Prerequisites
|
|
|
|
Ensure vLLM is installed and the environment is activated:
|
|
|
|
```bash
|
|
# Assuming vllm-install is in your home directory
|
|
source ~/vllm-install/vllm_env.sh
|
|
```
|
|
|
|
## Examples
|
|
|
|
### 1. Basic Inference (`basic_inference.py`)
|
|
|
|
Simple text generation using the vLLM Python API.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python basic_inference.py
|
|
```
|
|
|
|
**What it demonstrates:**
|
|
- Loading a model with vLLM
|
|
- Configuring sampling parameters
|
|
- Generating multiple completions
|
|
- Batch processing
|
|
|
|
### 2. API Client (`api_client.py`)
|
|
|
|
Using vLLM's OpenAI-compatible REST API.
|
|
|
|
**Prerequisites:**
|
|
Start the vLLM server first:
|
|
```bash
|
|
cd ~/vllm-install
|
|
./vllm-serve.sh
|
|
```
|
|
|
|
**Usage:**
|
|
```bash
|
|
python api_client.py
|
|
```
|
|
|
|
**What it demonstrates:**
|
|
- Listing available models
|
|
- Simple text completion
|
|
- Chat completion
|
|
- Streaming responses
|
|
- HTTP API interaction
|
|
|
|
### 3. Batch Processing (`batch_processing.py`)
|
|
|
|
Efficient processing of large batches of prompts.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python batch_processing.py
|
|
```
|
|
|
|
**What it demonstrates:**
|
|
- High-throughput batch inference
|
|
- Dynamic batching
|
|
- Memory-efficient processing
|
|
- Performance monitoring
|
|
|
|
## Customization
|
|
|
|
### Change Model
|
|
|
|
Edit the model name in any example:
|
|
|
|
```python
|
|
llm = LLM(
|
|
model="meta-llama/Llama-3.1-8B-Instruct", # Change this
|
|
trust_remote_code=True,
|
|
gpu_memory_utilization=0.9
|
|
)
|
|
```
|
|
|
|
### Adjust Sampling Parameters
|
|
|
|
Modify `SamplingParams` for different generation behavior:
|
|
|
|
```python
|
|
sampling_params = SamplingParams(
|
|
temperature=0.7, # Lower = more deterministic (0.0-1.0)
|
|
top_p=0.95, # Nucleus sampling threshold
|
|
max_tokens=100, # Maximum tokens to generate
|
|
top_k=50, # Top-k sampling
|
|
repetition_penalty=1.1 # Penalize repetition
|
|
)
|
|
```
|
|
|
|
### GPU Memory Management
|
|
|
|
Adjust memory utilization:
|
|
|
|
```python
|
|
llm = LLM(
|
|
model="...",
|
|
gpu_memory_utilization=0.9, # Use 90% of GPU memory (0.0-1.0)
|
|
max_model_len=2048 # Maximum sequence length
|
|
)
|
|
```
|
|
|
|
## API Server Examples
|
|
|
|
### cURL Examples
|
|
|
|
**List models:**
|
|
```bash
|
|
curl http://localhost:8000/v1/models
|
|
```
|
|
|
|
**Simple completion:**
|
|
```bash
|
|
curl http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
|
"prompt": "The meaning of life is",
|
|
"max_tokens": 50,
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
**Chat completion:**
|
|
```bash
|
|
curl http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
|
"messages": [
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "What is DGX Spark?"}
|
|
],
|
|
"max_tokens": 100,
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
**Streaming completion:**
|
|
```bash
|
|
curl http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
|
"prompt": "Write a story about",
|
|
"max_tokens": 100,
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
## Tested Models
|
|
|
|
These models work well on DGX Spark GB10:
|
|
|
|
- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
|
|
- `Qwen/Qwen2.5-7B-Instruct` (balanced)
|
|
- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
|
|
- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use GPU memory efficiently:**
|
|
- Set `gpu_memory_utilization=0.95` for maximum throughput
|
|
- Lower for models close to GPU memory limit
|
|
|
|
2. **Batch processing:**
|
|
- Process multiple prompts together
|
|
- vLLM automatically optimizes batch sizes
|
|
|
|
3. **Quantization:**
|
|
- For larger models, use quantization:
|
|
```python
|
|
llm = LLM(model="...", quantization="awq")
|
|
```
|
|
|
|
4. **Tensor parallelism:**
|
|
- For models > 20GB, use multiple GPUs:
|
|
```python
|
|
llm = LLM(model="...", tensor_parallel_size=2)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Out of Memory
|
|
|
|
Reduce `max_model_len` or `gpu_memory_utilization`:
|
|
|
|
```python
|
|
llm = LLM(
|
|
model="...",
|
|
gpu_memory_utilization=0.8,
|
|
max_model_len=2048
|
|
)
|
|
```
|
|
|
|
### Slow Generation
|
|
|
|
Check if model is loaded correctly:
|
|
|
|
```python
|
|
python -c "import vllm; print(vllm.__version__)"
|
|
nvidia-smi # Check GPU utilization
|
|
```
|
|
|
|
### Connection Refused (API)
|
|
|
|
Ensure server is running:
|
|
|
|
```bash
|
|
cd ~/vllm-install
|
|
./vllm-status.sh
|
|
```
|
|
|
|
## More Resources
|
|
|
|
- [vLLM Documentation](https://docs.vllm.ai/)
|
|
- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
|
|
- [Main README](../README.md)
|
|
- [Cluster Setup](../CLUSTER.md)
|