Run RL Swarm (Testnet) Node

RL Swarm is a fully open-source framework developed by GensynAI for building reinforcement learning (RL) training swarms over the internet. This guide walks you through setting up an RL Swarm node and a web UI dashboard to monitor swarm activity.

Hardware Requirements

There are currently multiple swarms running on the Testnet, each training on a different data set. The current list of available models and swarms include:

Models: Qwen 2.5 0.5B, Qwen 2.5 1.5B, Qwen 2.5 7B, Qwen 2.5 32B (4 bit) & Qwen 2.5 72B (4 bit)
Swarms: Math (GSM8K dataset) & Math Hard (DAPO-Math 17K dataset)

Your hardware requirements will vary depending on which swarm and model you choose. Users with less powerful hardware should select a smaller model (e.g. Qwen 0.5B or 1.5B) and smaller dataset (GSM8K) A. Users with more powerful hardware can select a larger model (e.g. Qwen 7B, 32B or 72B) and larger dataset (DAPO-Math 17K) B. The requirements for each are listed below:

Small model (0.5B or 1.5B) + Math (GSM8K dataset)

CPU-only: arm64 or x86 CPU with minimum 16gb ram (note that if you run other applications during training it might crash training).

OR

GPU:
- RTX 3090
- RTX 4090
- A100
- H100
- ≥24GB vRAM GPU is recommended, but Gensyn now supports <24GB vRAM GPUs too.
- ≥12.4 CUDA Driver

Big model (7B, 32B or 72B) + Math Hard (DAPO-Math 17K dataset)

GPU: A100 (80GB) or H100 (80GB)

Method 1 - Windows Users (Home PC):

If you are a windows user, you may need to install Ubuntu on your windows.

Install Ubuntu on Windows: Guide
After you installed Ubuntu on Windows, Verify you already have NVIDIA Driver & CUDA Toolkit ready:

# Install NVIDIA Toolkit
sudo apt-get update
sudo apt-get install -y nvidia-cuda-toolkit

# Verify NVIDIA Driver
nvidia-smi

# Verify CUDA Toolkit:
nvcc --version

Method 2 - Rent Cloud GPU:

You can use Rent and Config GPU Guide to get a fully detailed guide or follow the steps below.

1- Rent Vast.ai GPUs

1- Register in Vast.ai
2- Create ssh key in your local system (If you don't have already) with this Guide: step 1
3- Create an SSH key in Vast.ai by going to three-lines > Keys > SSH Keys here
4- Paste SSH public key created in your local pc in step 2.
5- Select Pytorch(Vast) template here
6- Choose a supported GPU (I recommend =24GB GPU vRAM, but Gensyn now supporting even 8GB GPU vRAM)
7- Increase Disk Space slidebar to 50GB
8- Top-up credits with crypto and rent it.
9- Go to instances, refresh the page, click on key button.
10- If you don't see a ssh command in Direct ssh connect: section, then you have to press on Add SSH Key.
11- Copy SSH Command, and Replace -L 3000:localhost:3000 in front of the command.
12- Enter the command in Windows Powershell and run it.
13- It prompts you for your ssh public key password (if you set before), then your GPU terminal appears.

2- Rent Hyperbolic GPUs

1- Register In Hyperbolic Dashboard
2- then, Visit Settings
3- Create a new Public SSH key and paste your pubkey into it and save it!
4- Choose a GPU (.eg RTX 4090) here by going to Home > GPU List and click on Rent
5- Make sure you select 1 as GPU Count.
6- Select pytorch as Template.
7- Rent it.
8- By clicking on your gpu instance, if gives you a SSH command to connect to your GPU terminal.
9- Add this flag: -L 3000:localhost:3000 in front of your Hyperbolic's SSH command, this will allow you to access to port 3000 on your local system.
10- Paste and Enter the command you copied in Windows PowerShell to access your server.
11- It prompts you for your ssh public key password (if you set before), then your GPU terminal appears.

1) Install Dependencies

1. Update System Packages

sudo apt-get update && sudo apt-get upgrade -y

2. Install General Utilities and Tools

sudo apt install screen curl iptables build-essential git wget lz4 jq make gcc nano automake autoconf tmux htop nvme-cli libgbm1 pkg-config libssl-dev libleveldb-dev tar clang bsdmainutils ncdu unzip libleveldb-dev  -y

3. Install Python

sudo apt-get install python3 python3-pip python3-venv python3-dev -y

4. Install Node

sudo apt-get update
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt-get install -y nodejs
node -v
sudo npm install -g yarn
yarn -v

5. Install Yarn

curl -o- -L https://yarnpkg.com/install.sh | bash

export PATH="$HOME/.yarn/bin:$HOME/.config/yarn/global/node_modules/.bin:$PATH"

source ~/.bashrc

2) Get HuggingFace Access token

1- Create account in HuggingFace

2- Create an Access Token with Write permissions here and save it

3) Clone the Repository

git clone https://github.com/gensyn-ai/rl-swarm/

4) Run the swarm

Open a screen to run it in background

screen -S swarm

Get into the rl-swarm directory

cd rl-swarm

Install swarm

python3 -m venv .venv
source .venv/bin/activate
./run_rl_swarm.sh

Would you like to connect to the Testnet? [Y/n] >>> Press Y to join testnet
Which swarm would you like to join (Math (A) or Math Hard (B))? [A/b] >>> We have two type of Swarms:
- A: Math (GSM8K dataset) -- Lower systems (>8GB) -- Use Small model (0.5B or 1.5B) for it.
- B: Math Hard (DAPO-Math 17K dataset) -- Higher systems -- Use Big model (7B, 32B or 72B) for it.
How many parameters (in billions)? [0.5, 1.5, 7, 32, 72] >>> 0.5 is minimal and 72 is very big model. Choose based on your system.
Check Step Hardware Requirement for more clue.

5) Login

1- You have to receive Waiting for userData.json to be created... in logs

2- Open login page in browser

Local PC: http://localhost:3000/
GPU Cloud Users: Before conneting to your GPU, Add this flag: -L 3000:localhost:3000 in front of your GPU's SSH Command, to be able to open http://localhost:3000/ in your Local PC's browser.
VPS users: You have to forward port by entering a command in the Windows Powershell of your local PC:
- In windows start menu, Search Windows Powershell and open its terminal in your local PC.
- Enter the command below and replace your vps ip with Server_IP and your vps port(.eg 22) with SSH_PORT
```
ssh -L 3000:localhost:3000 root@Server_IP -p SSH_PORT
```
- ⚠️ Make sure you enter the command in your own local PC's Windows Powershell and NOT your VPS terminal.
- This prompts you to enter your VPS password, when you enter it, you connect and tunnel to your vps
- Now go to browser and open http://localhost:3000/ and login

3- Login with your preferred method

After login, your terminal starts installation.

4- Optional: Push models to huggingface

Enter your HuggingFace access token you've created when it prompted.
This will need 2GB upload bandwidth for each model you train, you can pass it by entering N.

Node Name

Now your node started running, Find your name after word Hello, like mine is whistling hulking armadillo as in the image below (You can use CTRL+SHIFT+F to search Hello in terminal)

Screen commands

Minimize: CTRL + A + D
Return: screen -r swarm
Stop and Kill: screen -XS swarm quit

Backup

You need to backup swarm.pem.

`VPS`:

Connect your VPS using Mobaxterm client to be able to move files to your local system. Back up these files:**

/root/rl-swarm/swarm.pem

`WSL`:

Search \\wsl.localhost in your Windows Explorer to see your Ubuntu directory. Your main directories are as follows:

If installed via a username: \\wsl.localhost\Ubuntu\home\<your_username>
If installed via root: \\wsl.localhost\Ubuntu\root
Look for rl-swarm/swarm.pem

`GPU servers (.eg, Hyperbolic)`:

1- Connect to your GPU server by entering this command in Windows PowerShell terminal

sftp -P PORT ubuntu@xxxx.hyperbolic.xyz

Replace ubuntu@xxxx.hyperbolic.xyz with your given GPU hostname
Replace PORT with your server port (in your server ssh connection command)
ubuntu is the user of my hyperbolic gpu, it can be anything else or it's root if you test it out for vps

Once connected, you’ll see the SFTP prompt:

sftp>

2- Navigate to the Directory Containing the Files

After connecting, you’ll start in your home directory on the server. Use the cd command to move to the directory of your files:

cd /home/ubuntu/rl-swarm

3- Download Files

Use the get command to download the files to your local system. They’ll save to your current local directory unless you specify otherwise:

get swarm.pem

Downloaded file is in the main directory of your Powershell or WSL where you entered the sFTP command.
- If entered sftp command in Porwershell, the swarm.pem file might be in C:\Users\<pc-username>.
You can now type exit to close connection. The files are in the main directory of your Powershell or WSL where you entered the first SFTP command.

Recovering Backup file (upload)

If you need to upload files from your local machine to the server.

WSL & VPS: Drag & Drop option.

GPU servers (.eg, Hyperbolic):

1- Connect to your GPU server using sFTP

2- Upload Files Using the put Command:

In SFTP, the put command uploads files from your local machine to the server.

put swarm.pem /home/ubuntu/rl-swarm/swarm.pem

Node Health

Official Dashboards

Math (GSM8K dataset): https://dashboard-math.gensyn.ai/
Math Hard (DAPO-Math 17K dataset): https://dashboard-math-hard.gensyn.ai/

Telegram Bot

Search you Node ID here with /check here: https://t.me/gensyntrackbot

Node-ID is near your Node name

⚠️ If receiving EVM Wallet: 0x0000000000000000000000000000000000000000, your onchain-participation is not being tracked and you have to Install with New Email and Delete old swarm.pem

Update Node

1- Stop Node

# list screens
screen -ls

# kill swarm screens (replace screen-id)
screen -XS screen-id quit

# You can kill by name
screen -XS swarm quit

2- Update Node Repository

Method 1 (test this first): If you cloned official repo with no local changes:

cd rl-swarm
git pull

Method 2: If you cloned official repo with local Changes:

cl rl-swarm

# Reset local changes:
git reset --hard
# Pull updates:
git pull

# Alternatively:
git fetch
git reset --hard origin/main

You have to do your local changes again.

Method 3: Cloned unofficial repo or Try from scratch (Recommended):

cd rl-swarm

# backup .pem
cp ./swarm.pem ~/swarm.pem

cd ..

# delete rl-swarm dir
rm -rf rl-swarm

# clone new repo
git clone https://github.com/gensyn-ai/rl-swarm

cd rl-swarm

# Recover .pem
cp ~/swarm.pem ./swarm.pem

If you had any local changes, you have to do it again.

3- Re-run Node

Head back to 4) Run the swarm and re-run Node.

Troubleshooting:

CPU Configuration

Fix 1:

cd rl-swarm

nano hivemind_exp/configs/mac/grpo-qwen-2.5-0.5b-deepseek-r1.yaml

Change bf16 value to false
Reduce max_steps to 5

Fix 2: Use this as a run command instead:

python3 -m venv .venv
source .venv/bin/activate
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 && ./run_rl_swarm.sh

⚠️ Stuck at loading localhost page

cd rl-swarm

sed -i '/^  return (/i\  useEffect(() => {\n    if (!user && !signerStatus.isInitializing) {\n      openAuthModal();\n    }\n  }, [user, signerStatus.isInitializing]);\n\n' modal-login/app/page.tsx

⚠️ Error: PS1 unbound variable

sed -i '1i # ~/.bashrc: executed by bash(1) for non-login shells.\n\n# If not running interactively, don'\''t do anything\ncase $- in\n    *i*) ;;\n    *) return;;\nesac\n' ~/.bashrc

⚠️ Daemon failed to start in 15.0 seconds

Enter rl-swarm directory:

cd rl-swarm

Activate python venv:

python3 -m venv .venv
source .venv/bin/activate

Open Daemon config file:

nano $(python3 -c "import hivemind.p2p.p2p_daemon as m; print(m.__file__)")

Search for line: startup_timeout: float = 15, then change 15 to 120 to increate the Daemon's timeout. the line should look like this: startup_timeout: float = 120
To save the file: Press Ctrl + X, Y & Enter

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
LICENSE		LICENSE
README.md		README.md
devnet(old).md		devnet(old).md

License

0xmoei/gensyn-ai

Folders and files

Latest commit

History

Repository files navigation