381 lines
9.0 KiB
Markdown
381 lines
9.0 KiB
Markdown
# vLLM Cluster Mode Setup for DGX Spark
|
|
|
|
This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference.
|
|
|
|
## Prerequisites
|
|
|
|
- Multiple DGX Spark systems with vLLM installed (use `install.sh` on each node)
|
|
- All nodes on the same network with direct connectivity
|
|
- SSH access between nodes (passwordless SSH recommended)
|
|
- Same CUDA and vLLM versions across all nodes
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ spark-alpha │
|
|
│ (Master/Head) │
|
|
│ - API Server │
|
|
│ - Request Router │
|
|
│ - Model Weights │
|
|
└──────────┬──────────┘
|
|
│
|
|
├─────────────────────┐
|
|
│ │
|
|
┌──────────▼──────────┐ ┌──────▼──────────┐
|
|
│ spark-omega │ │ spark-gamma │
|
|
│ (Worker 1) │ │ (Worker 2) │
|
|
│ - Inference │ │ - Inference │
|
|
│ - GPU Compute │ │ - GPU Compute │
|
|
└─────────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
## Step 1: Install vLLM on All Nodes
|
|
|
|
Run the installer on each node:
|
|
|
|
```bash
|
|
# On spark-alpha (master)
|
|
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
|
|
|
|
# On spark-omega (worker 1)
|
|
ssh spark-omega.local
|
|
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
|
|
|
|
# On spark-gamma (worker 2)
|
|
ssh spark-gamma.local
|
|
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
|
|
```
|
|
|
|
## Step 2: Configure Network Settings
|
|
|
|
Ensure all nodes can communicate on the required ports:
|
|
|
|
- **8000**: vLLM API server (master only)
|
|
- **29500**: PyTorch distributed backend (all nodes)
|
|
- **Random ports**: Ray cluster communication
|
|
|
|
Open firewall if needed:
|
|
|
|
```bash
|
|
# On all nodes
|
|
sudo ufw allow 8000/tcp
|
|
sudo ufw allow 29500/tcp
|
|
sudo ufw allow 6379/tcp # Ray GCS
|
|
sudo ufw allow 8265/tcp # Ray Dashboard
|
|
```
|
|
|
|
## Step 3: Set Up Passwordless SSH (Optional but Recommended)
|
|
|
|
```bash
|
|
# On master node
|
|
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""
|
|
|
|
# Copy to worker nodes
|
|
ssh-copy-id spark-omega.local
|
|
ssh-copy-id spark-gamma.local
|
|
|
|
# Verify
|
|
ssh spark-omega.local "echo 'Connection successful'"
|
|
ssh spark-gamma.local "echo 'Connection successful'"
|
|
```
|
|
|
|
## Step 4: Start Ray Cluster
|
|
|
|
### On Master Node (spark-alpha)
|
|
|
|
```bash
|
|
# Assuming vllm-install is in your home directory
|
|
source ~/vllm-install/vllm_env.sh
|
|
|
|
# Start Ray head node
|
|
ray start --head \
|
|
--port=6379 \
|
|
--dashboard-host=0.0.0.0 \
|
|
--dashboard-port=8265 \
|
|
--num-gpus=1
|
|
|
|
# Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'"
|
|
```
|
|
|
|
### On Worker Nodes (spark-omega, spark-gamma)
|
|
|
|
```bash
|
|
source ~/vllm-install/vllm_env.sh
|
|
|
|
# Replace MASTER_IP with spark-alpha's IP address
|
|
ray start --address='MASTER_IP:6379' --num-gpus=1
|
|
```
|
|
|
|
Verify cluster status:
|
|
|
|
```bash
|
|
ray status
|
|
```
|
|
|
|
You should see all nodes listed.
|
|
|
|
## Step 5: Start vLLM with Tensor Parallelism
|
|
|
|
### Method 1: Tensor Parallelism (Recommended for Large Models)
|
|
|
|
Tensor parallelism splits model layers across multiple GPUs.
|
|
|
|
```bash
|
|
# On master node
|
|
source ~/vllm-install/vllm_env.sh
|
|
|
|
vllm serve \
|
|
--model "meta-llama/Llama-3.1-70B-Instruct" \
|
|
--tensor-parallel-size 2 \
|
|
--trust-remote-code \
|
|
--host 0.0.0.0 \
|
|
--port 8000
|
|
```
|
|
|
|
This will automatically distribute the model across 2 GPUs in the Ray cluster.
|
|
|
|
### Method 2: Pipeline Parallelism
|
|
|
|
Pipeline parallelism splits model stages across GPUs.
|
|
|
|
```bash
|
|
vllm serve \
|
|
--model "meta-llama/Llama-3.1-70B-Instruct" \
|
|
--pipeline-parallel-size 2 \
|
|
--trust-remote-code \
|
|
--host 0.0.0.0 \
|
|
--port 8000
|
|
```
|
|
|
|
### Method 3: Combined Parallelism
|
|
|
|
For very large models, combine tensor and pipeline parallelism:
|
|
|
|
```bash
|
|
vllm serve \
|
|
--model "meta-llama/Llama-3.1-405B-Instruct" \
|
|
--tensor-parallel-size 4 \
|
|
--pipeline-parallel-size 2 \
|
|
--trust-remote-code \
|
|
--host 0.0.0.0 \
|
|
--port 8000
|
|
```
|
|
|
|
## Step 6: Test Cluster Inference
|
|
|
|
```bash
|
|
# Test from master node
|
|
curl http://localhost:8000/v1/models
|
|
|
|
# Test from external machine
|
|
curl http://spark-alpha.local:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "meta-llama/Llama-3.1-70B-Instruct",
|
|
"prompt": "Explain distributed inference in 3 sentences.",
|
|
"max_tokens": 100,
|
|
"temperature": 0.7
|
|
}'
|
|
```
|
|
|
|
## Step 7: Monitor Cluster
|
|
|
|
### Ray Dashboard
|
|
|
|
Access at: http://spark-alpha.local:8265
|
|
|
|
Shows:
|
|
- Node status and resources
|
|
- Task execution
|
|
- GPU utilization
|
|
- Memory usage
|
|
|
|
### vLLM Metrics
|
|
|
|
```bash
|
|
# On master node
|
|
tail -f ~/vllm-install/vllm-server.log
|
|
|
|
# Check GPU usage across cluster
|
|
ray exec 'nvidia-smi'
|
|
```
|
|
|
|
### System Monitoring
|
|
|
|
```bash
|
|
# Check Ray cluster status
|
|
ray status
|
|
|
|
# Monitor GPU usage on specific node
|
|
ssh spark-omega.local nvidia-smi -l 1
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Workers Not Connecting
|
|
|
|
**Problem**: Workers can't connect to Ray head node
|
|
|
|
**Solutions**:
|
|
1. Check firewall: `sudo ufw status`
|
|
2. Verify head node IP: `ray status` on master
|
|
3. Check network connectivity: `ping spark-alpha.local`
|
|
4. Ensure same Ray version on all nodes: `ray --version`
|
|
|
|
### OOM Errors with Large Models
|
|
|
|
**Problem**: Out of memory when loading large models
|
|
|
|
**Solutions**:
|
|
1. Increase tensor parallelism: `--tensor-parallel-size 4`
|
|
2. Reduce memory utilization: `--gpu-memory-utilization 0.8`
|
|
3. Enable CPU offloading: `--cpu-offload-gb 8`
|
|
4. Use quantization: `--quantization awq` or `--quantization gptq`
|
|
|
|
### Model Loading Hangs
|
|
|
|
**Problem**: Model download/loading takes forever
|
|
|
|
**Solutions**:
|
|
1. Pre-download model on all nodes:
|
|
```bash
|
|
# On each node
|
|
python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')"
|
|
```
|
|
2. Use shared storage (NFS) for model cache
|
|
3. Check network bandwidth between nodes
|
|
|
|
### Uneven GPU Utilization
|
|
|
|
**Problem**: Some GPUs idle while others maxed out
|
|
|
|
**Solutions**:
|
|
1. Verify tensor parallel configuration
|
|
2. Check Ray resource allocation: `ray status`
|
|
3. Ensure balanced request distribution
|
|
4. Monitor with: `ray exec 'nvidia-smi'`
|
|
|
|
## Advanced Configuration
|
|
|
|
### Custom Ray Resources
|
|
|
|
Assign custom resources to nodes for fine-grained control:
|
|
|
|
```bash
|
|
# On worker with high memory
|
|
ray start --address='MASTER_IP:6379' \
|
|
--num-gpus=1 \
|
|
--resources='{"highmem": 1}'
|
|
|
|
# Use in vLLM
|
|
vllm serve --model "..." --placement-group-resources='{"highmem": 1}'
|
|
```
|
|
|
|
### Distributed Model Cache
|
|
|
|
Share model weights via NFS to avoid redundant downloads:
|
|
|
|
```bash
|
|
# On NFS server (e.g., master)
|
|
sudo apt install nfs-kernel-server
|
|
echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
|
|
sudo exportfs -a
|
|
|
|
# On workers
|
|
sudo apt install nfs-common
|
|
sudo mkdir -p $HOME/.cache/huggingface
|
|
sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface
|
|
```
|
|
|
|
### Load Balancing with nginx
|
|
|
|
For production deployments, use nginx to load balance across multiple vLLM instances:
|
|
|
|
```nginx
|
|
upstream vllm_cluster {
|
|
least_conn;
|
|
server spark-alpha.local:8000;
|
|
server spark-omega.local:8000;
|
|
server spark-gamma.local:8000;
|
|
}
|
|
|
|
server {
|
|
listen 80;
|
|
location / {
|
|
proxy_pass http://vllm_cluster;
|
|
proxy_set_header Host $host;
|
|
}
|
|
}
|
|
```
|
|
|
|
## Cluster Management Scripts
|
|
|
|
### Start Cluster
|
|
|
|
Create `start-cluster.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Start Ray cluster on all nodes
|
|
|
|
ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379"
|
|
sleep 5
|
|
|
|
MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'")
|
|
|
|
ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
|
|
ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
|
|
|
|
echo "Cluster started. Check status with: ray status"
|
|
```
|
|
|
|
### Stop Cluster
|
|
|
|
Create `stop-cluster.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Stop Ray cluster on all nodes
|
|
|
|
for node in spark-alpha.local spark-omega.local spark-gamma.local; do
|
|
echo "Stopping Ray on $node..."
|
|
ssh $node "ray stop --force"
|
|
done
|
|
|
|
echo "Cluster stopped."
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### For Maximum Throughput
|
|
|
|
```bash
|
|
vllm serve \
|
|
--model "meta-llama/Llama-3.1-70B-Instruct" \
|
|
--tensor-parallel-size 2 \
|
|
--max-num-seqs 256 \
|
|
--max-num-batched-tokens 8192 \
|
|
--gpu-memory-utilization 0.95
|
|
```
|
|
|
|
### For Low Latency
|
|
|
|
```bash
|
|
vllm serve \
|
|
--model "meta-llama/Llama-3.1-70B-Instruct" \
|
|
--tensor-parallel-size 2 \
|
|
--max-num-seqs 32 \
|
|
--disable-log-requests
|
|
```
|
|
|
|
## References
|
|
|
|
- [vLLM Distributed Inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
|
- [Ray Cluster Setup](https://docs.ray.io/en/latest/cluster/getting-started.html)
|
|
- [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html)
|
|
|
|
## Support
|
|
|
|
For issues specific to DGX Spark cluster setup, please open an issue on GitHub.
|