first commit

2026-03-22 17:26:26 -04:00
commit c05cb71816
15 changed files with 2644 additions and 0 deletions
--- a/examples/README.md
+++ b/examples/README.md
@@ -0,0 +1,225 @@
+# vLLM Examples for DGX Spark
+
+This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
+
+## Prerequisites
+
+Ensure vLLM is installed and the environment is activated:
+
+```bash
+# Assuming vllm-install is in your home directory
+source ~/vllm-install/vllm_env.sh
+```
+
+## Examples
+
+### 1. Basic Inference (`basic_inference.py`)
+
+Simple text generation using the vLLM Python API.
+
+**Usage:**
+```bash
+python basic_inference.py
+```
+
+**What it demonstrates:**
+- Loading a model with vLLM
+- Configuring sampling parameters
+- Generating multiple completions
+- Batch processing
+
+### 2. API Client (`api_client.py`)
+
+Using vLLM's OpenAI-compatible REST API.
+
+**Prerequisites:**
+Start the vLLM server first:
+```bash
+cd ~/vllm-install
+./vllm-serve.sh
+```
+
+**Usage:**
+```bash
+python api_client.py
+```
+
+**What it demonstrates:**
+- Listing available models
+- Simple text completion
+- Chat completion
+- Streaming responses
+- HTTP API interaction
+
+### 3. Batch Processing (`batch_processing.py`)
+
+Efficient processing of large batches of prompts.
+
+**Usage:**
+```bash
+python batch_processing.py
+```
+
+**What it demonstrates:**
+- High-throughput batch inference
+- Dynamic batching
+- Memory-efficient processing
+- Performance monitoring
+
+## Customization
+
+### Change Model
+
+Edit the model name in any example:
+
+```python
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
+    trust_remote_code=True,
+    gpu_memory_utilization=0.9
+)
+```
+
+### Adjust Sampling Parameters
+
+Modify `SamplingParams` for different generation behavior:
+
+```python
+sampling_params = SamplingParams(
+    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
+    top_p=0.95,          # Nucleus sampling threshold
+    max_tokens=100,      # Maximum tokens to generate
+    top_k=50,            # Top-k sampling
+    repetition_penalty=1.1  # Penalize repetition
+)
+```
+
+### GPU Memory Management
+
+Adjust memory utilization:
+
+```python
+llm = LLM(
+    model="...",
+    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
+    max_model_len=2048           # Maximum sequence length
+)
+```
+
+## API Server Examples
+
+### cURL Examples
+
+**List models:**
+```bash
+curl http://localhost:8000/v1/models
+```
+
+**Simple completion:**
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "prompt": "The meaning of life is",
+    "max_tokens": 50,
+    "temperature": 0.7
+  }'
+```
+
+**Chat completion:**
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "What is DGX Spark?"}
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7
+  }'
+```
+
+**Streaming completion:**
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "prompt": "Write a story about",
+    "max_tokens": 100,
+    "stream": true
+  }'
+```
+
+## Tested Models
+
+These models work well on DGX Spark GB10:
+
+- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
+- `Qwen/Qwen2.5-7B-Instruct` (balanced)
+- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
+- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)
+
+## Performance Tips
+
+1. **Use GPU memory efficiently:**
+   - Set `gpu_memory_utilization=0.95` for maximum throughput
+   - Lower for models close to GPU memory limit
+
+2. **Batch processing:**
+   - Process multiple prompts together
+   - vLLM automatically optimizes batch sizes
+
+3. **Quantization:**
+   - For larger models, use quantization:
+   ```python
+   llm = LLM(model="...", quantization="awq")
+   ```
+
+4. **Tensor parallelism:**
+   - For models > 20GB, use multiple GPUs:
+   ```python
+   llm = LLM(model="...", tensor_parallel_size=2)
+   ```
+
+## Troubleshooting
+
+### Out of Memory
+
+Reduce `max_model_len` or `gpu_memory_utilization`:
+
+```python
+llm = LLM(
+    model="...",
+    gpu_memory_utilization=0.8,
+    max_model_len=2048
+)
+```
+
+### Slow Generation
+
+Check if model is loaded correctly:
+
+```python
+python -c "import vllm; print(vllm.__version__)"
+nvidia-smi  # Check GPU utilization
+```
+
+### Connection Refused (API)
+
+Ensure server is running:
+
+```bash
+cd ~/vllm-install
+./vllm-status.sh
+```
+
+## More Resources
+
+- [vLLM Documentation](https://docs.vllm.ai/)
+- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
+- [Main README](../README.md)
+- [Cluster Setup](../CLUSTER.md)