Most of us don’t have access to a compute cluster, nor even a dedicated server to run projects on, but a lot of us have access to Linux-based lab computers provided by our school.

So as part of a class on parallel computing, shouldn’t everybody be able to test their code across the school computers? Of course! However I haven’t seen good documentation from any of these classes specific to Hydra, so I’ll write up a complete ‘zero-to-hero’ post on running OpenMPI programs across our lab machines.

Requirements

Access to the machines

First off you’ll need remote SSH access to the lab machines in question. At UTK each Hydra machine has it’s own public IPv4 address with SSH access enabled, so no problems there.

As part of getting access, you should set up SSH key-based auth. I recommend you only keep your key on your personal machine, since I personally would not trust keeping a private key on a shared system.

To ensure that you can still use this key to connect between the lab systems you’ll want to enable Agent Forwarding, which allows you to use your local key on remote servers without storing your private key there.

Here’s a recommended config for hydra:

Host hydra*
  Hostname %h.eecs.utk.edu
  User YOUR_NETID
  ForwardAgent yes
  ServerAliveInterval 120
  KeepAlive yes
  IdentityFile=~/path/to/your/private_key

Now you should be able to ssh hydra0 without typing in your netID password. Then while logged into a hydra machine, you should be able to ssh into any other hydra machine once again without needing your password.

If it still requires your password to ssh between two hydra nodes, then make sure you’ve correctly enabled SSH agent forwarding from your local machine.

Installing OpenMPI

We obviously don’t have any access to install programs into the system-wide root, so instead we’ll just stick to the standard ~/.local prefix.

Find newer release download URLs from MPI’s download page.

mkdir -p ~/tmp
cd ~/tmp
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.5.tar.bz2
tar xvf openmpi-4.0.5.tar.bz2
cd openmpi-4.0.5
./configure --prefix=$HOME/.local --enable-picky --disable-debug --with-platform=optimized --enable-visibility --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default --with-cma --without-memkind
make -j$(nproc)
make install

Build a hostfile

Now we need to choose a set of hosts to run across. For Hydra it’s easy since they’re all numerically identified.

With ZSH I can generate a hostfile with a single command:

z=(hydra{1..30}); print ${(j.\n.)z} > hosts

Now if you need to narrow down the list (i.e. to use a square number of processes) just delete some lines from that hosts file.

For programs requiring a square number of hosts (like this CS462 homework) I’d recommend selecting 4, 8, or 16 machines. Keep in mind which systems other people are using, since you’re sharing CPU time with everyone.

In order to discover who’s using what machines (and to select lesser-used nodes) I wrote a script to summarize who’s logged in where.

Running your programs

For the ease of copy-pasting by CS462 students the program will be hw_tester -vt large3.dat, replace with whatever program / args yours is.

If you’re lucky, just mpirun --hostfile hosts ./hw_tester -vt large3.dat will work, however Hydra users are not so lucky.

Hydra limitations

So instead I had to dig around a bit to find a few key pieces of info:

We need to avoid selecting the wrong network interface to use: I ended up adding combinations of --mca btl_tcp_if_exclude virbr0,lo,virbr0-nic to exclude unwanted interfaces and --mca btl_base_verbose 100 to increase the verbosity of the network setup so I could debug further.

Then I found that the default port range couldn’t be bound to on the Hydra machines, so we need to increase the starting port (--mca btl_tcp_port_min_v4 MIN_PORT_NUM) and also reduce the port range it selects from. (--mca btl_tcp_port_range_v4 PORT_RANGE_SIZE)

Keep in mind that we don’t want to overlap with anybody else’s port range who might also be following this tutorial, so I’d recommend choosing a starting port number somewhere between 18000 and 28000 in increments of 100. (--mca btl_tcp_port_range_v4 100)

So in the end I was able to successfully run my program with:

mpirun --mca btl_tcp_if_exclude virbr0,lo,virbr0-nic --mca btl_tcp_port_min_v4 21200 --mca btl_tcp_port_range_v4 100 --hostfile hosts -npernode 1 ./hw_tester -vt large3.dat

GPU Cooling in the R720XD

There is a good reason Dell didn't support GPUs in it's R720XD servers (I hope), but that didn't stop me from doing it anyways.
It s...
Continue reading...

FDAC@UTK: SSH & Container tutorial

Published on September 04, 2023

Authentik group assignment on invitation usage

Published on February 25, 2023