# vLLM Cluster Mode Setup for DGX Spark This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference. ## Prerequisites - Multiple DGX Spark systems with vLLM installed (use `install.sh` on each node) - All nodes on the same network with direct connectivity - SSH access between nodes (passwordless SSH recommended) - Same CUDA and vLLM versions across all nodes ## Architecture ``` ┌─────────────────────┐ │ spark-alpha │ │ (Master/Head) │ │ - API Server │ │ - Request Router │ │ - Model Weights │ └──────────┬──────────┘ │ ├─────────────────────┐ │ │ ┌──────────▼──────────┐ ┌──────▼──────────┐ │ spark-omega │ │ spark-gamma │ │ (Worker 1) │ │ (Worker 2) │ │ - Inference │ │ - Inference │ │ - GPU Compute │ │ - GPU Compute │ └─────────────────────┘ └─────────────────┘ ``` ## Step 1: Install vLLM on All Nodes Run the installer on each node: ```bash # On spark-alpha (master) curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash # On spark-omega (worker 1) ssh spark-omega.local curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash # On spark-gamma (worker 2) ssh spark-gamma.local curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash ``` ## Step 2: Configure Network Settings Ensure all nodes can communicate on the required ports: - **8000**: vLLM API server (master only) - **29500**: PyTorch distributed backend (all nodes) - **Random ports**: Ray cluster communication Open firewall if needed: ```bash # On all nodes sudo ufw allow 8000/tcp sudo ufw allow 29500/tcp sudo ufw allow 6379/tcp # Ray GCS sudo ufw allow 8265/tcp # Ray Dashboard ``` ## Step 3: Set Up Passwordless SSH (Optional but Recommended) ```bash # On master node ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N "" # Copy to worker nodes ssh-copy-id spark-omega.local ssh-copy-id spark-gamma.local # Verify ssh spark-omega.local "echo 'Connection successful'" ssh spark-gamma.local "echo 'Connection successful'" ``` ## Step 4: Start Ray Cluster ### On Master Node (spark-alpha) ```bash # Assuming vllm-install is in your home directory source ~/vllm-install/vllm_env.sh # Start Ray head node ray start --head \ --port=6379 \ --dashboard-host=0.0.0.0 \ --dashboard-port=8265 \ --num-gpus=1 # Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'" ``` ### On Worker Nodes (spark-omega, spark-gamma) ```bash source ~/vllm-install/vllm_env.sh # Replace MASTER_IP with spark-alpha's IP address ray start --address='MASTER_IP:6379' --num-gpus=1 ``` Verify cluster status: ```bash ray status ``` You should see all nodes listed. ## Step 5: Start vLLM with Tensor Parallelism ### Method 1: Tensor Parallelism (Recommended for Large Models) Tensor parallelism splits model layers across multiple GPUs. ```bash # On master node source ~/vllm-install/vllm_env.sh vllm serve \ --model "meta-llama/Llama-3.1-70B-Instruct" \ --tensor-parallel-size 2 \ --trust-remote-code \ --host 0.0.0.0 \ --port 8000 ``` This will automatically distribute the model across 2 GPUs in the Ray cluster. ### Method 2: Pipeline Parallelism Pipeline parallelism splits model stages across GPUs. ```bash vllm serve \ --model "meta-llama/Llama-3.1-70B-Instruct" \ --pipeline-parallel-size 2 \ --trust-remote-code \ --host 0.0.0.0 \ --port 8000 ``` ### Method 3: Combined Parallelism For very large models, combine tensor and pipeline parallelism: ```bash vllm serve \ --model "meta-llama/Llama-3.1-405B-Instruct" \ --tensor-parallel-size 4 \ --pipeline-parallel-size 2 \ --trust-remote-code \ --host 0.0.0.0 \ --port 8000 ``` ## Step 6: Test Cluster Inference ```bash # Test from master node curl http://localhost:8000/v1/models # Test from external machine curl http://spark-alpha.local:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "prompt": "Explain distributed inference in 3 sentences.", "max_tokens": 100, "temperature": 0.7 }' ``` ## Step 7: Monitor Cluster ### Ray Dashboard Access at: http://spark-alpha.local:8265 Shows: - Node status and resources - Task execution - GPU utilization - Memory usage ### vLLM Metrics ```bash # On master node tail -f ~/vllm-install/vllm-server.log # Check GPU usage across cluster ray exec 'nvidia-smi' ``` ### System Monitoring ```bash # Check Ray cluster status ray status # Monitor GPU usage on specific node ssh spark-omega.local nvidia-smi -l 1 ``` ## Troubleshooting ### Workers Not Connecting **Problem**: Workers can't connect to Ray head node **Solutions**: 1. Check firewall: `sudo ufw status` 2. Verify head node IP: `ray status` on master 3. Check network connectivity: `ping spark-alpha.local` 4. Ensure same Ray version on all nodes: `ray --version` ### OOM Errors with Large Models **Problem**: Out of memory when loading large models **Solutions**: 1. Increase tensor parallelism: `--tensor-parallel-size 4` 2. Reduce memory utilization: `--gpu-memory-utilization 0.8` 3. Enable CPU offloading: `--cpu-offload-gb 8` 4. Use quantization: `--quantization awq` or `--quantization gptq` ### Model Loading Hangs **Problem**: Model download/loading takes forever **Solutions**: 1. Pre-download model on all nodes: ```bash # On each node python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')" ``` 2. Use shared storage (NFS) for model cache 3. Check network bandwidth between nodes ### Uneven GPU Utilization **Problem**: Some GPUs idle while others maxed out **Solutions**: 1. Verify tensor parallel configuration 2. Check Ray resource allocation: `ray status` 3. Ensure balanced request distribution 4. Monitor with: `ray exec 'nvidia-smi'` ## Advanced Configuration ### Custom Ray Resources Assign custom resources to nodes for fine-grained control: ```bash # On worker with high memory ray start --address='MASTER_IP:6379' \ --num-gpus=1 \ --resources='{"highmem": 1}' # Use in vLLM vllm serve --model "..." --placement-group-resources='{"highmem": 1}' ``` ### Distributed Model Cache Share model weights via NFS to avoid redundant downloads: ```bash # On NFS server (e.g., master) sudo apt install nfs-kernel-server echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports sudo exportfs -a # On workers sudo apt install nfs-common sudo mkdir -p $HOME/.cache/huggingface sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface ``` ### Load Balancing with nginx For production deployments, use nginx to load balance across multiple vLLM instances: ```nginx upstream vllm_cluster { least_conn; server spark-alpha.local:8000; server spark-omega.local:8000; server spark-gamma.local:8000; } server { listen 80; location / { proxy_pass http://vllm_cluster; proxy_set_header Host $host; } } ``` ## Cluster Management Scripts ### Start Cluster Create `start-cluster.sh`: ```bash #!/bin/bash # Start Ray cluster on all nodes ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379" sleep 5 MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'") ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'" ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'" echo "Cluster started. Check status with: ray status" ``` ### Stop Cluster Create `stop-cluster.sh`: ```bash #!/bin/bash # Stop Ray cluster on all nodes for node in spark-alpha.local spark-omega.local spark-gamma.local; do echo "Stopping Ray on $node..." ssh $node "ray stop --force" done echo "Cluster stopped." ``` ## Performance Tuning ### For Maximum Throughput ```bash vllm serve \ --model "meta-llama/Llama-3.1-70B-Instruct" \ --tensor-parallel-size 2 \ --max-num-seqs 256 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 ``` ### For Low Latency ```bash vllm serve \ --model "meta-llama/Llama-3.1-70B-Instruct" \ --tensor-parallel-size 2 \ --max-num-seqs 32 \ --disable-log-requests ``` ## References - [vLLM Distributed Inference](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) - [Ray Cluster Setup](https://docs.ray.io/en/latest/cluster/getting-started.html) - [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) ## Support For issues specific to DGX Spark cluster setup, please open an issue on GitHub.