iDatam

IN AFRICA

ALBANIA

ARGENTINA

AUSTRALIA

AUSTRIA

AZERBAIJAN

B AND H

BANGLADESH

BELGIUM

BRAZIL

BULGARIA

CANADA

CHILE

CHINA

COLOMBIA

COSTA RICA

CROATIA

CYPRUS

CZECH

DENMARK

ECUADOR

EGYPT

EL SALVADOR

ESTONIA

FINLAND

FOR BACKUP AND STORAGE

FOR DATABASE

FOR EMAIL

FOR MEDIA STREAMING

FRANCE

GEORGIA

GERMANY

GREECE

GUATEMALA

HUNGARY

ICELAND

IN ASIA

IN AUSTRALIA

IN EUROPE

IN NORTH AMERICA

IN SOUTH AMERICA

INDIA

INDONESIA

IRELAND

ISRAEL

ITALY

JAPAN

KAZAKHSTAN

KENYA

KOSOVO

LATVIA

LIBYA

LITHUANIA

LUXEMBOURG

MALAYSIA

MALTA

MEXICO

MOLDOVA

MONTENEGRO

MOROCCO

NETHERLANDS

NEW ZEALAND

NIGERIA

NORWAY

PAKISTAN

PANAMA

PARAGUAY

PERU

PHILIPPINES

POLAND

PORTUGAL

QATAR

ROMANIA

RUSSIA

SAUDI ARABIA

SERBIA

SINGAPORE

SLOVAKIA

SLOVENIA

SOUTH AFRICA

SOUTH KOREA

SPAIN

SWEDEN

SWITZERLAND

TAIWAN

THAILAND

TUNISIA

TURKEY

UK

UKRAINE

UNITED ARAB EMIRATES

URUGUAY

USA

UZBEKISTAN

VIETNAM

Configuring a Multi-Node GPU Cluster for Distributed LLM Training using Ray

Outgrown a single GPU? Learn how to link multiple bare-metal servers into a unified supercomputer using Ray for distributed LLM training. Stop bottlenecking your AI models and start scaling horizontally.

Multi-Node GPU Cluster setup for distributed LLM training using Ray

Training a Large Language Model (LLM) or fine-tuning a massive dataset eventually hits a hard wall: the physical limits of a single server. Even a top-tier machine packed with eight NVIDIA H100s will run out of VRAM when processing models nearing the 100-billion parameter mark.

When vertical scaling (buying a bigger server) is no longer an option, you must scale horizontally. You need to take 3 or 4 separate GPU servers and link them together so they act as one unified supercomputer.

The industry standard framework for orchestrating this distributed training is Ray. Developed by the team at Anyscale, Ray simplifies the complex math of dividing neural networks across multiple machines. However, setting up a Ray cluster on raw hardware is notoriously difficult.

In this definitive guide, we will show you exactly how to configure a multi-node GPU cluster using Ray on Ubuntu.

The Infrastructure Reality: Distributed training generates massive "east-west" network traffic as GPUs constantly synchronize their gradients. If you attempt this on a standard 10Gbps cloud network, your GPUs will spend 80% of their time waiting for data (GPU starvation). To actually achieve linear scaling, run this setup on an iDatam GPU Dedicated Server cluster connected via our unmetered 100Gbps backend network.

What You'll Learn

The Cluster Architecture

For this tutorial, we will use three bare-metal Ubuntu 24.04 LTS servers, each equipped with NVIDIA GPUs.

  • Node 1 (10.0.0.11): The Ray "Head Node" (Orchestrates the cluster and runs the dashboard).

  • Node 2 (10.0.0.12): Ray "Worker Node".

  • Node 3 (10.0.0.13): Ray "Worker Node".

(We assume you have already installed the NVIDIA proprietary drivers and CUDA toolkit on all nodes. If not, see our PyTorch setup guide first).

Step 1: Network Configuration and SSH Keys

Execute this step on all three nodes.

Ray requires the nodes to communicate openly over the internal network. First, map the hostnames in the /etc/hosts file:

bash

sudo nano /etc/hosts
                                

Add the private IP addresses of your cluster:

plaintext

10.0.0.11 ray-head
10.0.0.12 ray-worker1
10.0.0.13 ray-worker2
                                

Generate SSH Keys on the Head Node: Ray needs passwordless SSH access to the worker nodes to automatically start and stop services. Log into Node 1 (ray-head) and generate an SSH key:

bash

ssh-keygen -t rsa -b 4096
                                

(Press Enter to accept the defaults and do not set a passphrase).

Copy this key to the worker nodes:

bash

ssh-copy-id root@10.0.0.12
ssh-copy-id root@10.0.0.13
                                

Step 2: Install Ray and PyTorch

Execute this step on all three nodes.

To ensure all nodes have the exact same software environment, we will use a Python virtual environment. Install the prerequisites:

bash

sudo apt update
sudo apt install python3-pip python3-venv -y
                                

Create and activate a virtual environment:

bash

python3 -m venv ~/ray_env
source ~/ray_env/bin/activate
                                

Install Ray (including the dashboard and tuning libraries) and PyTorch with CUDA support:

bash

pip install "ray[default]" "ray[tune]" "ray[rllib]" "ray[serve]" torch torchvision torchaudio
                                

Verify the Ray installation:

bash

ray --version
                                

Step 3: Initialize the Ray Head Node

Now we transform Node 1 into the orchestrator of the cluster. Log into Node 1 (ray-head), ensure your virtual environment is active, and run the following command to start the Ray Head process:

bash

ray start --head --port=6379 --dashboard-host=0.0.0.0
                                
  • --head: Tells this node it is the master.

  • --port=6379: The default Redis port Ray uses for internal state management.

  • --dashboard-host=0.0.0.0: Binds the Ray Dashboard to all network interfaces so you can view it from your web browser.

The terminal will output a success message containing a specific command that looks like this: ray start --address='10.0.0.11:6379'. Copy this exact command! You will need it for the worker nodes.

Step 4: Attach the Worker Nodes

Log into Node 2 (ray-worker1) and Node 3 (ray-worker2). Activate the virtual environment on both machines:

bash

source ~/ray_env/bin/activate
                                

Paste the command you copied from the Head Node:

bash

ray start --address='10.0.0.11:6379'
                                

If successful, the terminal will say Ray runtime started.

Step 5: Verify the Cluster and View the Dashboard

To prove the cluster is unified, go back to Node 1 (ray-head) and run a quick Python script to count the available GPUs. Open the Python shell:

bash

python
                                

Enter this code:

python

import ray
ray.init(address='auto')

# Print total cluster resources
print(ray.cluster_resources())
exit()
                                

The output will show the combined CPU cores and the total number of GPUs across all three machines.

The Ray Dashboard: Open your web browser and navigate to the public IP of your Head Node on port 8265: http://<HEAD_NODE_PUBLIC_IP>:8265. Here, you can visually monitor the CPU utilization, GPU memory, and network throughput of every node in your new supercomputer.

Step 6: Submitting a Distributed Training Job

With the cluster online, you can now submit distributed PyTorch jobs using Ray Train. Instead of running a script directly, you use the ray job submit command from the Head Node. Ray will automatically copy your code to the worker nodes, divide the dataset, and synchronize the gradients during training.

bash

ray job submit --working-dir ./my_ai_project -- python train_llm.py
                                

Conclusion: The 100Gbps Necessity

You have successfully built a distributed GPU cluster. As you begin training massive models, watch the "Network" tab on your Ray Dashboard carefully.

During the backward pass of a neural network, the worker nodes must share gigabytes of gradient data instantly. If you built this cluster on standard 10Gbps hardware, you will see your GPU utilization drop to 20% while they wait for the network to catch up.

To prevent GPU starvation and ensure your AI training scales linearly, deploy your Ray cluster on iDatam’s 100Gbps Dedicated Servers. By connecting your NVIDIA nodes with unmetered, non-blocking 100Gbps fabrics, your distributed supercomputer will perform exactly as it was designed to.

Discover iDatam Dedicated Server Locations

iDatam servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.

Up