Files
2026-03-22 17:26:26 -04:00

9.0 KiB

vLLM Cluster Mode Setup for DGX Spark

This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference.

Prerequisites

  • Multiple DGX Spark systems with vLLM installed (use install.sh on each node)
  • All nodes on the same network with direct connectivity
  • SSH access between nodes (passwordless SSH recommended)
  • Same CUDA and vLLM versions across all nodes

Architecture

┌─────────────────────┐
│   spark-alpha       │
│   (Master/Head)     │
│   - API Server      │
│   - Request Router  │
│   - Model Weights   │
└──────────┬──────────┘
           │
           ├─────────────────────┐
           │                     │
┌──────────▼──────────┐  ┌──────▼──────────┐
│   spark-omega       │  │   spark-gamma   │
│   (Worker 1)        │  │   (Worker 2)    │
│   - Inference       │  │   - Inference   │
│   - GPU Compute     │  │   - GPU Compute │
└─────────────────────┘  └─────────────────┘

Step 1: Install vLLM on All Nodes

Run the installer on each node:

# On spark-alpha (master)
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

# On spark-omega (worker 1)
ssh spark-omega.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

# On spark-gamma (worker 2)
ssh spark-gamma.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

Step 2: Configure Network Settings

Ensure all nodes can communicate on the required ports:

  • 8000: vLLM API server (master only)
  • 29500: PyTorch distributed backend (all nodes)
  • Random ports: Ray cluster communication

Open firewall if needed:

# On all nodes
sudo ufw allow 8000/tcp
sudo ufw allow 29500/tcp
sudo ufw allow 6379/tcp   # Ray GCS
sudo ufw allow 8265/tcp   # Ray Dashboard
# On master node
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""

# Copy to worker nodes
ssh-copy-id spark-omega.local
ssh-copy-id spark-gamma.local

# Verify
ssh spark-omega.local "echo 'Connection successful'"
ssh spark-gamma.local "echo 'Connection successful'"

Step 4: Start Ray Cluster

On Master Node (spark-alpha)

# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh

# Start Ray head node
ray start --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --num-gpus=1

# Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'"

On Worker Nodes (spark-omega, spark-gamma)

source ~/vllm-install/vllm_env.sh

# Replace MASTER_IP with spark-alpha's IP address
ray start --address='MASTER_IP:6379' --num-gpus=1

Verify cluster status:

ray status

You should see all nodes listed.

Step 5: Start vLLM with Tensor Parallelism

Tensor parallelism splits model layers across multiple GPUs.

# On master node
source ~/vllm-install/vllm_env.sh

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

This will automatically distribute the model across 2 GPUs in the Ray cluster.

Method 2: Pipeline Parallelism

Pipeline parallelism splits model stages across GPUs.

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Method 3: Combined Parallelism

For very large models, combine tensor and pipeline parallelism:

vllm serve \
  --model "meta-llama/Llama-3.1-405B-Instruct" \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Step 6: Test Cluster Inference

# Test from master node
curl http://localhost:8000/v1/models

# Test from external machine
curl http://spark-alpha.local:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain distributed inference in 3 sentences.",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Step 7: Monitor Cluster

Ray Dashboard

Access at: http://spark-alpha.local:8265

Shows:

  • Node status and resources
  • Task execution
  • GPU utilization
  • Memory usage

vLLM Metrics

# On master node
tail -f ~/vllm-install/vllm-server.log

# Check GPU usage across cluster
ray exec 'nvidia-smi'

System Monitoring

# Check Ray cluster status
ray status

# Monitor GPU usage on specific node
ssh spark-omega.local nvidia-smi -l 1

Troubleshooting

Workers Not Connecting

Problem: Workers can't connect to Ray head node

Solutions:

  1. Check firewall: sudo ufw status
  2. Verify head node IP: ray status on master
  3. Check network connectivity: ping spark-alpha.local
  4. Ensure same Ray version on all nodes: ray --version

OOM Errors with Large Models

Problem: Out of memory when loading large models

Solutions:

  1. Increase tensor parallelism: --tensor-parallel-size 4
  2. Reduce memory utilization: --gpu-memory-utilization 0.8
  3. Enable CPU offloading: --cpu-offload-gb 8
  4. Use quantization: --quantization awq or --quantization gptq

Model Loading Hangs

Problem: Model download/loading takes forever

Solutions:

  1. Pre-download model on all nodes:
    # On each node
    python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')"
    
  2. Use shared storage (NFS) for model cache
  3. Check network bandwidth between nodes

Uneven GPU Utilization

Problem: Some GPUs idle while others maxed out

Solutions:

  1. Verify tensor parallel configuration
  2. Check Ray resource allocation: ray status
  3. Ensure balanced request distribution
  4. Monitor with: ray exec 'nvidia-smi'

Advanced Configuration

Custom Ray Resources

Assign custom resources to nodes for fine-grained control:

# On worker with high memory
ray start --address='MASTER_IP:6379' \
  --num-gpus=1 \
  --resources='{"highmem": 1}'

# Use in vLLM
vllm serve --model "..." --placement-group-resources='{"highmem": 1}'

Distributed Model Cache

Share model weights via NFS to avoid redundant downloads:

# On NFS server (e.g., master)
sudo apt install nfs-kernel-server
echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a

# On workers
sudo apt install nfs-common
sudo mkdir -p $HOME/.cache/huggingface
sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface

Load Balancing with nginx

For production deployments, use nginx to load balance across multiple vLLM instances:

upstream vllm_cluster {
    least_conn;
    server spark-alpha.local:8000;
    server spark-omega.local:8000;
    server spark-gamma.local:8000;
}

server {
    listen 80;
    location / {
        proxy_pass http://vllm_cluster;
        proxy_set_header Host $host;
    }
}

Cluster Management Scripts

Start Cluster

Create start-cluster.sh:

#!/bin/bash
# Start Ray cluster on all nodes

ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379"
sleep 5

MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'")

ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"

echo "Cluster started. Check status with: ray status"

Stop Cluster

Create stop-cluster.sh:

#!/bin/bash
# Stop Ray cluster on all nodes

for node in spark-alpha.local spark-omega.local spark-gamma.local; do
    echo "Stopping Ray on $node..."
    ssh $node "ray stop --force"
done

echo "Cluster stopped."

Performance Tuning

For Maximum Throughput

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95

For Low Latency

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --max-num-seqs 32 \
  --disable-log-requests

References

Support

For issues specific to DGX Spark cluster setup, please open an issue on GitHub.