9.0 KiB
vLLM Cluster Mode Setup for DGX Spark
This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference.
Prerequisites
- Multiple DGX Spark systems with vLLM installed (use
install.shon each node) - All nodes on the same network with direct connectivity
- SSH access between nodes (passwordless SSH recommended)
- Same CUDA and vLLM versions across all nodes
Architecture
┌─────────────────────┐
│ spark-alpha │
│ (Master/Head) │
│ - API Server │
│ - Request Router │
│ - Model Weights │
└──────────┬──────────┘
│
├─────────────────────┐
│ │
┌──────────▼──────────┐ ┌──────▼──────────┐
│ spark-omega │ │ spark-gamma │
│ (Worker 1) │ │ (Worker 2) │
│ - Inference │ │ - Inference │
│ - GPU Compute │ │ - GPU Compute │
└─────────────────────┘ └─────────────────┘
Step 1: Install vLLM on All Nodes
Run the installer on each node:
# On spark-alpha (master)
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
# On spark-omega (worker 1)
ssh spark-omega.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
# On spark-gamma (worker 2)
ssh spark-gamma.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash
Step 2: Configure Network Settings
Ensure all nodes can communicate on the required ports:
- 8000: vLLM API server (master only)
- 29500: PyTorch distributed backend (all nodes)
- Random ports: Ray cluster communication
Open firewall if needed:
# On all nodes
sudo ufw allow 8000/tcp
sudo ufw allow 29500/tcp
sudo ufw allow 6379/tcp # Ray GCS
sudo ufw allow 8265/tcp # Ray Dashboard
Step 3: Set Up Passwordless SSH (Optional but Recommended)
# On master node
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""
# Copy to worker nodes
ssh-copy-id spark-omega.local
ssh-copy-id spark-gamma.local
# Verify
ssh spark-omega.local "echo 'Connection successful'"
ssh spark-gamma.local "echo 'Connection successful'"
Step 4: Start Ray Cluster
On Master Node (spark-alpha)
# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh
# Start Ray head node
ray start --head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--num-gpus=1
# Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'"
On Worker Nodes (spark-omega, spark-gamma)
source ~/vllm-install/vllm_env.sh
# Replace MASTER_IP with spark-alpha's IP address
ray start --address='MASTER_IP:6379' --num-gpus=1
Verify cluster status:
ray status
You should see all nodes listed.
Step 5: Start vLLM with Tensor Parallelism
Method 1: Tensor Parallelism (Recommended for Large Models)
Tensor parallelism splits model layers across multiple GPUs.
# On master node
source ~/vllm-install/vllm_env.sh
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
This will automatically distribute the model across 2 GPUs in the Ray cluster.
Method 2: Pipeline Parallelism
Pipeline parallelism splits model stages across GPUs.
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--pipeline-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Method 3: Combined Parallelism
For very large models, combine tensor and pipeline parallelism:
vllm serve \
--model "meta-llama/Llama-3.1-405B-Instruct" \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Step 6: Test Cluster Inference
# Test from master node
curl http://localhost:8000/v1/models
# Test from external machine
curl http://spark-alpha.local:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": "Explain distributed inference in 3 sentences.",
"max_tokens": 100,
"temperature": 0.7
}'
Step 7: Monitor Cluster
Ray Dashboard
Access at: http://spark-alpha.local:8265
Shows:
- Node status and resources
- Task execution
- GPU utilization
- Memory usage
vLLM Metrics
# On master node
tail -f ~/vllm-install/vllm-server.log
# Check GPU usage across cluster
ray exec 'nvidia-smi'
System Monitoring
# Check Ray cluster status
ray status
# Monitor GPU usage on specific node
ssh spark-omega.local nvidia-smi -l 1
Troubleshooting
Workers Not Connecting
Problem: Workers can't connect to Ray head node
Solutions:
- Check firewall:
sudo ufw status - Verify head node IP:
ray statuson master - Check network connectivity:
ping spark-alpha.local - Ensure same Ray version on all nodes:
ray --version
OOM Errors with Large Models
Problem: Out of memory when loading large models
Solutions:
- Increase tensor parallelism:
--tensor-parallel-size 4 - Reduce memory utilization:
--gpu-memory-utilization 0.8 - Enable CPU offloading:
--cpu-offload-gb 8 - Use quantization:
--quantization awqor--quantization gptq
Model Loading Hangs
Problem: Model download/loading takes forever
Solutions:
- Pre-download model on all nodes:
# On each node python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')" - Use shared storage (NFS) for model cache
- Check network bandwidth between nodes
Uneven GPU Utilization
Problem: Some GPUs idle while others maxed out
Solutions:
- Verify tensor parallel configuration
- Check Ray resource allocation:
ray status - Ensure balanced request distribution
- Monitor with:
ray exec 'nvidia-smi'
Advanced Configuration
Custom Ray Resources
Assign custom resources to nodes for fine-grained control:
# On worker with high memory
ray start --address='MASTER_IP:6379' \
--num-gpus=1 \
--resources='{"highmem": 1}'
# Use in vLLM
vllm serve --model "..." --placement-group-resources='{"highmem": 1}'
Distributed Model Cache
Share model weights via NFS to avoid redundant downloads:
# On NFS server (e.g., master)
sudo apt install nfs-kernel-server
echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a
# On workers
sudo apt install nfs-common
sudo mkdir -p $HOME/.cache/huggingface
sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface
Load Balancing with nginx
For production deployments, use nginx to load balance across multiple vLLM instances:
upstream vllm_cluster {
least_conn;
server spark-alpha.local:8000;
server spark-omega.local:8000;
server spark-gamma.local:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_cluster;
proxy_set_header Host $host;
}
}
Cluster Management Scripts
Start Cluster
Create start-cluster.sh:
#!/bin/bash
# Start Ray cluster on all nodes
ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379"
sleep 5
MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'")
ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
echo "Cluster started. Check status with: ray status"
Stop Cluster
Create stop-cluster.sh:
#!/bin/bash
# Stop Ray cluster on all nodes
for node in spark-alpha.local spark-omega.local spark-gamma.local; do
echo "Stopping Ray on $node..."
ssh $node "ray stop --force"
done
echo "Cluster stopped."
Performance Tuning
For Maximum Throughput
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--max-num-seqs 256 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95
For Low Latency
vllm serve \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--tensor-parallel-size 2 \
--max-num-seqs 32 \
--disable-log-requests
References
Support
For issues specific to DGX Spark cluster setup, please open an issue on GitHub.