TBNilles/dgx-spark-vllm-setup

Fork 0

Files

Thomas Nilles c05cb71816 first commit

2026-03-22 17:26:26 -04:00

9.0 KiB

Raw Blame History

vLLM Cluster Mode Setup for DGX Spark

This guide covers setting up multi-node vLLM deployment on DGX Spark systems using distributed inference.

Prerequisites

Multiple DGX Spark systems with vLLM installed (use install.sh on each node)
All nodes on the same network with direct connectivity
SSH access between nodes (passwordless SSH recommended)
Same CUDA and vLLM versions across all nodes

Architecture

┌─────────────────────┐
│   spark-alpha       │
│   (Master/Head)     │
│   - API Server      │
│   - Request Router  │
│   - Model Weights   │
└──────────┬──────────┘
           │
           ├─────────────────────┐
           │                     │
┌──────────▼──────────┐  ┌──────▼──────────┐
│   spark-omega       │  │   spark-gamma   │
│   (Worker 1)        │  │   (Worker 2)    │
│   - Inference       │  │   - Inference   │
│   - GPU Compute     │  │   - GPU Compute │
└─────────────────────┘  └─────────────────┘

Step 1: Install vLLM on All Nodes

Run the installer on each node:

# On spark-alpha (master)
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

# On spark-omega (worker 1)
ssh spark-omega.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

# On spark-gamma (worker 2)
ssh spark-gamma.local
curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

Step 2: Configure Network Settings

Ensure all nodes can communicate on the required ports:

8000: vLLM API server (master only)
29500: PyTorch distributed backend (all nodes)
Random ports: Ray cluster communication

Open firewall if needed:

# On all nodes
sudo ufw allow 8000/tcp
sudo ufw allow 29500/tcp
sudo ufw allow 6379/tcp   # Ray GCS
sudo ufw allow 8265/tcp   # Ray Dashboard

Step 3: Set Up Passwordless SSH (Optional but Recommended)

# On master node
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N ""

# Copy to worker nodes
ssh-copy-id spark-omega.local
ssh-copy-id spark-gamma.local

# Verify
ssh spark-omega.local "echo 'Connection successful'"
ssh spark-gamma.local "echo 'Connection successful'"

Step 4: Start Ray Cluster

On Master Node (spark-alpha)

# Assuming vllm-install is in your home directory
source ~/vllm-install/vllm_env.sh

# Start Ray head node
ray start --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --num-gpus=1

# Note the output: "To connect to this Ray cluster, use: ray start --address='MASTER_IP:6379'"

On Worker Nodes (spark-omega, spark-gamma)

source ~/vllm-install/vllm_env.sh

# Replace MASTER_IP with spark-alpha's IP address
ray start --address='MASTER_IP:6379' --num-gpus=1

Verify cluster status:

ray status

You should see all nodes listed.

Step 5: Start vLLM with Tensor Parallelism

Method 1: Tensor Parallelism (Recommended for Large Models)

Tensor parallelism splits model layers across multiple GPUs.

# On master node
source ~/vllm-install/vllm_env.sh

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

This will automatically distribute the model across 2 GPUs in the Ray cluster.

Method 2: Pipeline Parallelism

Pipeline parallelism splits model stages across GPUs.

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Method 3: Combined Parallelism

For very large models, combine tensor and pipeline parallelism:

vllm serve \
  --model "meta-llama/Llama-3.1-405B-Instruct" \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Step 6: Test Cluster Inference

# Test from master node
curl http://localhost:8000/v1/models

# Test from external machine
curl http://spark-alpha.local:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "Explain distributed inference in 3 sentences.",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Step 7: Monitor Cluster

Ray Dashboard

Access at: http://spark-alpha.local:8265

Shows:

Node status and resources
Task execution
GPU utilization
Memory usage

vLLM Metrics

# On master node
tail -f ~/vllm-install/vllm-server.log

# Check GPU usage across cluster
ray exec 'nvidia-smi'

System Monitoring

# Check Ray cluster status
ray status

# Monitor GPU usage on specific node
ssh spark-omega.local nvidia-smi -l 1

Troubleshooting

Workers Not Connecting

Problem: Workers can't connect to Ray head node

Solutions:

Check firewall: sudo ufw status
Verify head node IP: ray status on master
Check network connectivity: ping spark-alpha.local
Ensure same Ray version on all nodes: ray --version

OOM Errors with Large Models

Problem: Out of memory when loading large models

Solutions:

Increase tensor parallelism: --tensor-parallel-size 4
Reduce memory utilization: --gpu-memory-utilization 0.8
Enable CPU offloading: --cpu-offload-gb 8
Use quantization: --quantization awq or --quantization gptq

Model Loading Hangs

Problem: Model download/loading takes forever

Solutions:

Pre-download model on all nodes:

# On each node
python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-70B-Instruct')"

Use shared storage (NFS) for model cache
Check network bandwidth between nodes

Uneven GPU Utilization

Problem: Some GPUs idle while others maxed out

Solutions:

Verify tensor parallel configuration
Check Ray resource allocation: ray status
Ensure balanced request distribution
Monitor with: ray exec 'nvidia-smi'

Advanced Configuration

Custom Ray Resources

Assign custom resources to nodes for fine-grained control:

# On worker with high memory
ray start --address='MASTER_IP:6379' \
  --num-gpus=1 \
  --resources='{"highmem": 1}'

# Use in vLLM
vllm serve --model "..." --placement-group-resources='{"highmem": 1}'

Distributed Model Cache

Share model weights via NFS to avoid redundant downloads:

# On NFS server (e.g., master)
sudo apt install nfs-kernel-server
echo "$HOME/.cache/huggingface *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a

# On workers
sudo apt install nfs-common
sudo mkdir -p $HOME/.cache/huggingface
sudo mount spark-alpha.local:$HOME/.cache/huggingface $HOME/.cache/huggingface

Load Balancing with nginx

For production deployments, use nginx to load balance across multiple vLLM instances:

upstream vllm_cluster {
    least_conn;
    server spark-alpha.local:8000;
    server spark-omega.local:8000;
    server spark-gamma.local:8000;
}

server {
    listen 80;
    location / {
        proxy_pass http://vllm_cluster;
        proxy_set_header Host $host;
    }
}

Cluster Management Scripts

Start Cluster

Create start-cluster.sh:

#!/bin/bash
# Start Ray cluster on all nodes

ssh spark-alpha.local "source ~/vllm-install/vllm_env.sh && ray start --head --port=6379"
sleep 5

MASTER_IP=$(ssh spark-alpha.local "hostname -I | awk '{print \$1}'")

ssh spark-omega.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"
ssh spark-gamma.local "source ~/vllm-install/vllm_env.sh && ray start --address='${MASTER_IP}:6379'"

echo "Cluster started. Check status with: ray status"

Stop Cluster

Create stop-cluster.sh:

#!/bin/bash
# Stop Ray cluster on all nodes

for node in spark-alpha.local spark-omega.local spark-gamma.local; do
    echo "Stopping Ray on $node..."
    ssh $node "ray stop --force"
done

echo "Cluster stopped."

Performance Tuning

For Maximum Throughput

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95

For Low Latency

vllm serve \
  --model "meta-llama/Llama-3.1-70B-Instruct" \
  --tensor-parallel-size 2 \
  --max-num-seqs 32 \
  --disable-log-requests

References

Support

For issues specific to DGX Spark cluster setup, please open an issue on GitHub.

9.0 KiB Raw Blame History

vLLM Cluster Mode Setup for DGX Spark

Prerequisites

Architecture

Step 1: Install vLLM on All Nodes

Step 2: Configure Network Settings

Step 3: Set Up Passwordless SSH (Optional but Recommended)

Step 4: Start Ray Cluster

On Master Node (spark-alpha)

On Worker Nodes (spark-omega, spark-gamma)

Step 5: Start vLLM with Tensor Parallelism

Method 1: Tensor Parallelism (Recommended for Large Models)

Method 2: Pipeline Parallelism

Method 3: Combined Parallelism

Step 6: Test Cluster Inference

Step 7: Monitor Cluster

Ray Dashboard

vLLM Metrics

System Monitoring

Troubleshooting

Workers Not Connecting

OOM Errors with Large Models

Model Loading Hangs

Uneven GPU Utilization

Advanced Configuration

Custom Ray Resources

Distributed Model Cache

Load Balancing with nginx

Cluster Management Scripts

Start Cluster

Stop Cluster

Performance Tuning

For Maximum Throughput

For Low Latency

References

Support

9.0 KiB

Raw Blame History