first commit

2026-03-22 17:26:26 -04:00
commit c05cb71816
15 changed files with 2644 additions and 0 deletions
--- a/examples/README.md
+++ b/examples/README.md
@@ -0,0 +1,225 @@
+# vLLM Examples for DGX Spark
+
+This directory contains example scripts demonstrating various ways to use vLLM on DGX Spark systems.
+
+## Prerequisites
+
+Ensure vLLM is installed and the environment is activated:
+
+```bash
+# Assuming vllm-install is in your home directory
+source ~/vllm-install/vllm_env.sh
+```
+
+## Examples
+
+### 1. Basic Inference (`basic_inference.py`)
+
+Simple text generation using the vLLM Python API.
+
+**Usage:**
+```bash
+python basic_inference.py
+```
+
+**What it demonstrates:**
+- Loading a model with vLLM
+- Configuring sampling parameters
+- Generating multiple completions
+- Batch processing
+
+### 2. API Client (`api_client.py`)
+
+Using vLLM's OpenAI-compatible REST API.
+
+**Prerequisites:**
+Start the vLLM server first:
+```bash
+cd ~/vllm-install
+./vllm-serve.sh
+```
+
+**Usage:**
+```bash
+python api_client.py
+```
+
+**What it demonstrates:**
+- Listing available models
+- Simple text completion
+- Chat completion
+- Streaming responses
+- HTTP API interaction
+
+### 3. Batch Processing (`batch_processing.py`)
+
+Efficient processing of large batches of prompts.
+
+**Usage:**
+```bash
+python batch_processing.py
+```
+
+**What it demonstrates:**
+- High-throughput batch inference
+- Dynamic batching
+- Memory-efficient processing
+- Performance monitoring
+
+## Customization
+
+### Change Model
+
+Edit the model name in any example:
+
+```python
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",  # Change this
+    trust_remote_code=True,
+    gpu_memory_utilization=0.9
+)
+```
+
+### Adjust Sampling Parameters
+
+Modify `SamplingParams` for different generation behavior:
+
+```python
+sampling_params = SamplingParams(
+    temperature=0.7,      # Lower = more deterministic (0.0-1.0)
+    top_p=0.95,          # Nucleus sampling threshold
+    max_tokens=100,      # Maximum tokens to generate
+    top_k=50,            # Top-k sampling
+    repetition_penalty=1.1  # Penalize repetition
+)
+```
+
+### GPU Memory Management
+
+Adjust memory utilization:
+
+```python
+llm = LLM(
+    model="...",
+    gpu_memory_utilization=0.9,  # Use 90% of GPU memory (0.0-1.0)
+    max_model_len=2048           # Maximum sequence length
+)
+```
+
+## API Server Examples
+
+### cURL Examples
+
+**List models:**
+```bash
+curl http://localhost:8000/v1/models
+```
+
+**Simple completion:**
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "prompt": "The meaning of life is",
+    "max_tokens": 50,
+    "temperature": 0.7
+  }'
+```
+
+**Chat completion:**
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "What is DGX Spark?"}
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7
+  }'
+```
+
+**Streaming completion:**
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "prompt": "Write a story about",
+    "max_tokens": 100,
+    "stream": true
+  }'
+```
+
+## Tested Models
+
+These models work well on DGX Spark GB10:
+
+- `Qwen/Qwen2.5-0.5B-Instruct` (small, fast)
+- `Qwen/Qwen2.5-7B-Instruct` (balanced)
+- `meta-llama/Llama-3.1-8B-Instruct` (high quality)
+- `meta-llama/Llama-3.1-70B-Instruct` (requires tensor parallelism)
+
+## Performance Tips
+
+1. **Use GPU memory efficiently:**
+   - Set `gpu_memory_utilization=0.95` for maximum throughput
+   - Lower for models close to GPU memory limit
+
+2. **Batch processing:**
+   - Process multiple prompts together
+   - vLLM automatically optimizes batch sizes
+
+3. **Quantization:**
+   - For larger models, use quantization:
+   ```python
+   llm = LLM(model="...", quantization="awq")
+   ```
+
+4. **Tensor parallelism:**
+   - For models > 20GB, use multiple GPUs:
+   ```python
+   llm = LLM(model="...", tensor_parallel_size=2)
+   ```
+
+## Troubleshooting
+
+### Out of Memory
+
+Reduce `max_model_len` or `gpu_memory_utilization`:
+
+```python
+llm = LLM(
+    model="...",
+    gpu_memory_utilization=0.8,
+    max_model_len=2048
+)
+```
+
+### Slow Generation
+
+Check if model is loaded correctly:
+
+```python
+python -c "import vllm; print(vllm.__version__)"
+nvidia-smi  # Check GPU utilization
+```
+
+### Connection Refused (API)
+
+Ensure server is running:
+
+```bash
+cd ~/vllm-install
+./vllm-status.sh
+```
+
+## More Resources
+
+- [vLLM Documentation](https://docs.vllm.ai/)
+- [OpenAI API Compatibility](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
+- [Main README](../README.md)
+- [Cluster Setup](../CLUSTER.md)
--- a/examples/api_client.py
+++ b/examples/api_client.py
@@ -0,0 +1,160 @@
+#!/usr/bin/env python3
+"""
+vLLM OpenAI-Compatible API Client Example
+Demonstrates using vLLM's OpenAI-compatible API endpoints
+"""
+
+import requests
+import json
+from typing import Dict, List
+
+class VLLMClient:
+    """Simple client for vLLM OpenAI-compatible API"""
+
+    def __init__(self, base_url: str = "http://localhost:8000"):
+        self.base_url = base_url.rstrip('/')
+
+    def list_models(self) -> List[Dict]:
+        """List available models"""
+        response = requests.get(f"{self.base_url}/v1/models")
+        response.raise_for_status()
+        return response.json()
+
+    def complete(
+        self,
+        prompt: str,
+        model: str = None,
+        max_tokens: int = 100,
+        temperature: float = 0.7,
+        stream: bool = False
+    ) -> Dict:
+        """Generate completion"""
+
+        # Get model name if not specified
+        if model is None:
+            models = self.list_models()
+            model = models['data'][0]['id']
+
+        payload = {
+            "model": model,
+            "prompt": prompt,
+            "max_tokens": max_tokens,
+            "temperature": temperature,
+            "stream": stream
+        }
+
+        response = requests.post(
+            f"{self.base_url}/v1/completions",
+            json=payload,
+            headers={"Content-Type": "application/json"},
+            stream=stream
+        )
+        response.raise_for_status()
+
+        if stream:
+            return response.iter_lines()
+        else:
+            return response.json()
+
+    def chat(
+        self,
+        messages: List[Dict[str, str]],
+        model: str = None,
+        max_tokens: int = 100,
+        temperature: float = 0.7,
+        stream: bool = False
+    ) -> Dict:
+        """Generate chat completion"""
+
+        # Get model name if not specified
+        if model is None:
+            models = self.list_models()
+            model = models['data'][0]['id']
+
+        payload = {
+            "model": model,
+            "messages": messages,
+            "max_tokens": max_tokens,
+            "temperature": temperature,
+            "stream": stream
+        }
+
+        response = requests.post(
+            f"{self.base_url}/v1/chat/completions",
+            json=payload,
+            headers={"Content-Type": "application/json"},
+            stream=stream
+        )
+        response.raise_for_status()
+
+        if stream:
+            return response.iter_lines()
+        else:
+            return response.json()
+
+
+def main():
+    # Initialize client
+    client = VLLMClient("http://localhost:8000")
+
+    print("="*60)
+    print("vLLM API Client Examples")
+    print("="*60)
+
+    # Example 1: List models
+    print("\n1. Listing available models...")
+    models = client.list_models()
+    for model in models['data']:
+        print(f"   - {model['id']}")
+
+    # Example 2: Simple completion
+    print("\n2. Simple completion...")
+    result = client.complete(
+        prompt="The capital of France is",
+        max_tokens=10,
+        temperature=0.0
+    )
+    print(f"   Prompt: The capital of France is")
+    print(f"   Response: {result['choices'][0]['text']}")
+
+    # Example 3: Chat completion
+    print("\n3. Chat completion...")
+    messages = [
+        {"role": "system", "content": "You are a helpful AI assistant."},
+        {"role": "user", "content": "What is the Blackwell GPU architecture?"}
+    ]
+    result = client.chat(
+        messages=messages,
+        max_tokens=100,
+        temperature=0.7
+    )
+    print(f"   User: {messages[1]['content']}")
+    print(f"   Assistant: {result['choices'][0]['message']['content']}")
+
+    # Example 4: Streaming completion
+    print("\n4. Streaming completion...")
+    print("   Prompt: Write a short poem about AI")
+    print("   Response: ", end="", flush=True)
+
+    stream = client.complete(
+        prompt="Write a short poem about AI",
+        max_tokens=50,
+        temperature=0.8,
+        stream=True
+    )
+
+    for line in stream:
+        if line:
+            try:
+                data = json.loads(line.decode('utf-8').removeprefix('data: '))
+                if 'choices' in data and len(data['choices']) > 0:
+                    token = data['choices'][0].get('text', '')
+                    print(token, end="", flush=True)
+            except (json.JSONDecodeError, AttributeError):
+                pass
+
+    print("\n")
+    print("="*60)
+
+if __name__ == "__main__":
+    main()
--- a/examples/basic_inference.py
+++ b/examples/basic_inference.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+"""
+Basic vLLM Inference Example for DGX Spark
+Demonstrates simple text generation using the vLLM Python API
+"""
+
+from vllm import LLM, SamplingParams
+
+def main():
+    # Initialize the model
+    # Use a smaller model for testing, replace with your preferred model
+    print("Loading model...")
+    llm = LLM(
+        model="Qwen/Qwen2.5-0.5B-Instruct",
+        trust_remote_code=True,
+        gpu_memory_utilization=0.9,
+        max_model_len=2048
+    )
+
+    # Define prompts
+    prompts = [
+        "What is the NVIDIA DGX Spark?",
+        "Explain the Blackwell GPU architecture in simple terms.",
+        "Write a haiku about artificial intelligence."
+    ]
+
+    # Configure sampling parameters
+    sampling_params = SamplingParams(
+        temperature=0.7,
+        top_p=0.95,
+        max_tokens=100,
+        stop=["</s>", "\n\n\n"]
+    )
+
+    # Generate responses
+    print("\nGenerating responses...\n")
+    outputs = llm.generate(prompts, sampling_params)
+
+    # Print results
+    for i, output in enumerate(outputs):
+        print(f"{'='*60}")
+        print(f"Prompt {i+1}: {prompts[i]}")
+        print(f"{'-'*60}")
+        print(f"Response: {output.outputs[0].text}")
+        print(f"{'='*60}\n")
+
+if __name__ == "__main__":
+    main()