Files
dgx-spark-vllm-setup/CLUSTER.md
2026-03-22 17:26:26 -04:00

381 lines
9.0 KiB
Markdown

# vLLM Cluster Mode Setup for DGX Spark
This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference.
## Prerequisites
- Multiple DGX Spark systems with vLLM installed (use `install.sh` on each node)
- All nodes on the same network with direct connectivity
- SSH access between nodes (passwordless SSH recommended)
- Same CUDA and vLLM versions across all nodes
## Architecture
```
┌─────────────────────┐
│ spark-alpha │
│ (Master/Head) │
│ - API Server │
│ - Request Router │
│ - Model Weights │
└──────────┬──────────┘
├─────────────────────┐
│ │
┌──────────▼──────────┐ ┌──────▼──────────┐
│ spark-omega │ │ spark-gamma │
│ (Worker 1) │ │ (Worker 2) │
│ - Inference │ │ - Inference │
│ - GPU Compute │ │ - GPU Compute │
└─────────────────────┘ └─────────────────┘
```
## Step 1: Install vLLM on All Nodes
Run the installer on each node:
```bash
# On spark-alpha (master)
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
# On spark-omega (worker 1)
ssh spark-omega.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
# On spark-gamma (worker 2)
ssh spark-gamma.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
```
## Step 2: Configure Network Settings
Ensure all nodes can communicate on the required ports:
- **8000**: vLLM API server (master only)
- **29500**: PyTorch distributed backend (all nodes)
- **Random ports**: Ray cluster communication
Open firewall if needed:
```bash
# On all nodes
sudo ufw allow 8000/tcp
sudo ufw allow 29500/tcp
sudo ufw allow 6379/tcp # Ray GCS
sudo ufw allow 8265/tcp # Ray Dashboard
```
## Step 3: Set Up Passwordless SSH (Optional but Recommended)
```bash
# On master node
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""
# Copy to worker nodes
ssh-copy-id spark-omega.local
ssh-copy-id spark-gamma.local
# Verify
ssh spark-omega.local "echo 'Connection successful'"
ssh spark-gamma.local "echo 'Connection successful'"
```
## Step 4: Start Ray Cluster
### On Master Node (spark-alpha)
```bash
# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh
# Start Ray head node
ray start --head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--num-gpus=1
# Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'"
```
### On Worker Nodes (spark-omega, spark-gamma)
```bash
source ~/vllm-install/vllm_env.sh
# Replace MASTER_IP with spark-alpha's IP address
ray start --address='MASTER_IP:6379' --num-gpus=1
```
Verify cluster status:
```bash
ray status
```
You should see all nodes listed.
## Step 5: Start vLLM with Tensor Parallelism
### Method 1: Tensor Parallelism (Recommended for Large Models)
Tensor parallelism splits model layers across multiple GPUs.
```bash
# On master node
source ~/vllm-install/vllm_env.sh
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
```
This will automatically distribute the model across 2 GPUs in the Ray cluster.
### Method 2: Pipeline Parallelism
Pipeline parallelism splits model stages across GPUs.
```bash
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--pipeline-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
```
### Method 3: Combined Parallelism
For very large models, combine tensor and pipeline parallelism:
```bash
vllm serve \
--model "meta-llama/Llama-3.1-405B-Instruct" \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
```
## Step 6: Test Cluster Inference
```bash
# Test from master node
curl http://localhost:8000/v1/models
# Test from external machine
curl http://spark-alpha.local:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": "Explain distributed inference in 3 sentences.",
"max_tokens": 100,
"temperature": 0.7
}'
```
## Step 7: Monitor Cluster
### Ray Dashboard
Access at: http://spark-alpha.local:8265
Shows:
- Node status and resources
- Task execution
- GPU utilization
- Memory usage
### vLLM Metrics
```bash
# On master node
tail -f ~/vllm-install/vllm-server.log
# Check GPU usage across cluster
ray exec 'nvidia-smi'
```
### System Monitoring
```bash
# Check Ray cluster status
ray status
# Monitor GPU usage on specific node
ssh spark-omega.local nvidia-smi -l 1
```
## Troubleshooting
### Workers Not Connecting
**Problem**: Workers can't connect to Ray head node
**Solutions**:
1. Check firewall: `sudo ufw status`
2. Verify head node IP: `ray status` on master
3. Check network connectivity: `ping spark-alpha.local`
4. Ensure same Ray version on all nodes: `ray --version`
### OOM Errors with Large Models
**Problem**: Out of memory when loading large models
**Solutions**:
1. Increase tensor parallelism: `--tensor-parallel-size 4`
2. Reduce memory utilization: `--gpu-memory-utilization 0.8`
3. Enable CPU offloading: `--cpu-offload-gb 8`
4. Use quantization: `--quantization awq` or `--quantization gptq`
### Model Loading Hangs
**Problem**: Model download/loading takes forever
**Solutions**:
1. Pre-download model on all nodes:
```bash
# On each node
python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')"
```
2. Use shared storage (NFS) for model cache
3. Check network bandwidth between nodes
### Uneven GPU Utilization
**Problem**: Some GPUs idle while others maxed out
**Solutions**:
1. Verify tensor parallel configuration
2. Check Ray resource allocation: `ray status`
3. Ensure balanced request distribution
4. Monitor with: `ray exec 'nvidia-smi'`
## Advanced Configuration
### Custom Ray Resources
Assign custom resources to nodes for fine-grained control:
```bash
# On worker with high memory
ray start --address='MASTER_IP:6379' \
--num-gpus=1 \
--resources='{"highmem": 1}'
# Use in vLLM
vllm serve --model "..." --placement-group-resources='{"highmem": 1}'
```
### Distributed Model Cache
Share model weights via NFS to avoid redundant downloads:
```bash
# On NFS server (e.g., master)
sudo apt install nfs-kernel-server
echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a
# On workers
sudo apt install nfs-common
sudo mkdir -p $HOME/.cache/huggingface
sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface
```
### Load Balancing with nginx
For production deployments, use nginx to load balance across multiple vLLM instances:
```nginx
upstream vllm_cluster {
least_conn;
server spark-alpha.local:8000;
server spark-omega.local:8000;
server spark-gamma.local:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_cluster;
proxy_set_header Host $host;
}
}
```
## Cluster Management Scripts
### Start Cluster
Create `start-cluster.sh`:
```bash
#!/bin/bash
# Start Ray cluster on all nodes
ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379"
sleep 5
MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'")
ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
echo "Cluster started. Check status with: ray status"
```
### Stop Cluster
Create `stop-cluster.sh`:
```bash
#!/bin/bash
# Stop Ray cluster on all nodes
for node in spark-alpha.local spark-omega.local spark-gamma.local; do
echo "Stopping Ray on $node..."
ssh $node "ray stop --force"
done
echo "Cluster stopped."
```
## Performance Tuning
### For Maximum Throughput
```bash
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--max-num-seqs 256 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95
```
### For Low Latency
```bash
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--max-num-seqs 32 \
--disable-log-requests
```
## References
- [vLLM Distributed Inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
- [Ray Cluster Setup](https://docs.ray.io/en/latest/cluster/getting-started.html)
- [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html)
## Support
For issues specific to DGX Spark cluster setup, please open an issue on GitHub.