# vLLM Examples for DGX Spark

This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.

## Prerequisites

Ensure vLLM is installed and the environment is activated:

```bash
# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh
```

## Examples

### 1. Basic Inference (`basic_inference.py`)

Simple text generation using the vLLM Python API.

**Usage:**
```bash
python basic_inference.py
```

**What it demonstrates:**
- Loading a model with vLLM
- Configuring sampling parameters
- Generating multiple completions
- Batch processing

### 2. API Client (`api_client.py`)

Using vLLM's OpenAI-compatible REST API.

**Prerequisites:**
Start the vLLM server first:
```bash
cd ~/vllm-install
./vllm-serve.sh
```

**Usage:**
```bash
python api_client.py
```

**What it demonstrates:**
- Listing available models
- Simple text completion
- Chat completion
- Streaming responses
- HTTP API interaction

### 3. Batch Processing (`batch_processing.py`)

Efficient processing of large batches of prompts.

**Usage:**
```bash
python batch_processing.py
```

**What it demonstrates:**
- High-throughput batch inference
- Dynamic batching
- Memory-efficient processing
- Performance monitoring

## Customization

### Change Model

Edit the model name in any example:

```python
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)
```

### Adjust Sampling Parameters

Modify `SamplingParams` for different generation behavior:

```python
sampling_params = SamplingParams(
    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
    top_p=0.95,          # Nucleus sampling threshold
    max_tokens=100,      # Maximum tokens to generate
    top_k=50,            # Top-k sampling
    repetition_penalty=1.1  # Penalize repetition
)
```

### GPU Memory Management

Adjust memory utilization:

```python
llm = LLM(
    model="...",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
    max_model_len=2048           # Maximum sequence length
)
```

## API Server Examples

### cURL Examples

**List models:**
```bash
curl http://localhost:8000/v1/models
```

**Simple completion:**
```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "The meaning of life is",
    "max_tokens": 50,
    "temperature": 0.7
  }'
```

**Chat completion:**
```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is DGX Spark?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

**Streaming completion:**
```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Write a story about",
    "max_tokens": 100,
    "stream": true
  }'
```

## Tested Models

These models work well on DGX Spark GB10:

- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
- `Qwen/Qwen2.5-7B-Instruct` (balanced)
- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)

## Performance Tips

1. **Use GPU memory efficiently:**
   - Set `gpu_memory_utilization=0.95` for maximum throughput
   - Lower for models close to GPU memory limit

2. **Batch processing:**
   - Process multiple prompts together
   - vLLM automatically optimizes batch sizes

3. **Quantization:**
   - For larger models, use quantization:
   ```python
   llm = LLM(model="...", quantization="awq")
   ```

4. **Tensor parallelism:**
   - For models > 20GB, use multiple GPUs:
   ```python
   llm = LLM(model="...", tensor_parallel_size=2)
   ```

## Troubleshooting

### Out of Memory

Reduce `max_model_len` or `gpu_memory_utilization`:

```python
llm = LLM(
    model="...",
    gpu_memory_utilization=0.8,
    max_model_len=2048
)
```

### Slow Generation

Check if model is loaded correctly:

```python
python -c "import vllm; print(vllm.__version__)"
nvidia-smi  # Check GPU utilization
```

### Connection Refused (API)

Ensure server is running:

```bash
cd ~/vllm-install
./vllm-status.sh
```

## More Resources

- [vLLM Documentation](https://docs.vllm.ai/)
- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
- [Main README](../README.md)
- [Cluster Setup](../CLUSTER.md)