Leveraging full nodes using Taskfarmer¶
Thomas Alexander <t.c.alexander@bath.ac.uk> and Gaël Donval <g.donval@bath.ac.uk>
When you submit a job to Balena, the submission script contains instructions requesting a set of computer resources for your calculation. These include the maximum execution time and number of machines (known as nodes) required. In general, hpc (high performance computing) clusters such as Balena are tuned for submissions that take advantage of many processors. Submissions such as these are referred to as parallel jobs and in our case, nodes are not shared between users. In other words, the minimum allocation on Balena is a full node comprising 16 separate CPU cores. This is problematic for programs such as MuSiC and Raspa which can’t natively make use of many cpu cores at once: we are going to see in this section how to work around that limitation with a program called Taskfarmer.
Serial vs. parallel programs¶
In the realm of hpc, there are only two main categories of software: serial programs —designed to run on a single cpu core— and parallel programs —designed to run on multiple cpu cores on multiple nodes at once. Running a program in parallel rather than in serial generally reduces execution time by splitting a fixed number of calculations on different cpu cores: without this, some calculation would take weeks or months on a single cpu core.
There is a catch though: which category any particular program belongs to is not for you to choose. Parallel programs can be run in parallel because of the way they were programmed; some problems cannot even be parallelised at all no matter how they are solved! MuSiC and Raspa for instance are serial programs: no matter how many nodes or cpu cores you allocate, no matter whether or not you prepend them with mpiexec (more on that later), no matter how many such instructions you put in your submission script, they are going to use one and only one cpu core out of all the cpu cores you allocated, wasting all the others.
Running serial programs on the cluster¶
The second catch is that you cannot allocate fewer than 16 cpu cores at once on Balena by design. There is absolutely no way around this. There is also no safeguard: if you are not careful and simply submit a serial job on Balena, nothing will prevent you from doing so but you will be wasting 15 cpu cores in the process. Such a behaviour will however automatically be penalised later on by the scheduler according to fair-use rules: at best, you’ll need to wait for days to get your jobs running and you won’t get much done.
The way around this is to resort to what is often called task farming: a list of serial tasks is given to a special program; that special program will assign serial jobs to each of the cpu cores on a node. Taskfarmer is such a program: if you provide it with at least 16 serial tasks (e.g. 16 different MuSiC calculations), it will execute those 16 tasks at the same time using all the available cpu cores keeping them all busy. This requires a little bit of planning but it is worth it.
Making the program available¶
As usual on Balena, you must make programs available by loading them.
If it is not already done, load the ce-molsim
repository:
$ module purge
$ module load group ce-molsim stack
In any case, load Taskfarmer:
$ module load taskfarmer
Now Taskfarmer should be available (watch out for error messages). You can confirm this by typing the following:
$ which taskfarmer
/apps/group/ce-molsim/taskfarmer/bin/taskfarmer
If something fails, the previous command should fail very loudly telling you where it was looking for it in the process.
Hint
Every time you log on Balena and for every job you want to submit,
you need to go through the module loading process all over again.
There is a way to make these changes permanent though: you can edit
your ~/.bashrc
file (which contains your command prompt
settings) and simply add the module purging/loading commands at the
end of it:
$ nano ~/.bashrc
The next time you’ll log on Balena, all the specified modules should be readily available once for all.
Usage¶
As with most command line programs, you can get the usage (or synopsis) of the
command by passing it the --help
flag:
$ taskfarmer --help
TaskFarmer - a simple task farmer for running serial
tasks with mpiexec.
Usage: mpiexec -n CORES taskfarmer [-h] -f FILE [-v] [-w] [-r] [-s SLEEP_TIME] [-m MAX_RETRIES]
Available options:
-h/--help : Print this help information
-f/--file <string> : Location of task file
(required)
-v/--verbose : Print status updates to
stdout
-w/--wait-on-idle : Wait for more tasks when idle
-r/--retry : Retry failed tasks
-s/--sleep-time <int> : Sleep duration when idle
(seconds)
-m/--max-retries <int> : Maximum number of retries
for failed tasks
In our case, this means that when we write our submission script to the scheduler, Taskfarmer should be invoked that way:
$ mpiexec -n 16 taskfarmer -f <task file>
Again, square brackets denote optional arguments. Here mpiexec
is a
reference to another program which purpose is to provide the number of
cpu cores Taskfarmer is allowed to use. The <task file>
is a simple text
file containing a list of serial calculations that Taskfarmer is supposed
to run concurrently on all the 16 cpu cores.
Writing a task file¶
You task file should look something like this:
bash ~/scratch/Simulations/01/run
bash ~/scratch/Simulations/02/run
bash ~/scratch/Simulations/03/run
bash ~/scratch/Simulations/04/run
bash ~/scratch/Simulations/05/run
bash ~/scratch/Simulations/06/run
bash ~/scratch/Simulations/07/run
bash ~/scratch/Simulations/08/run
bash ~/scratch/Simulations/09/run
bash ~/scratch/Simulations/10/run
bash ~/scratch/Simulations/11/run
bash ~/scratch/Simulations/12/run
bash ~/scratch/Simulations/13/run
bash ~/scratch/Simulations/14/run
bash ~/scratch/Simulations/15/run
bash ~/scratch/Simulations/16/run
Here we are supposing that you have fully set up your different MuSiC or
Raspa simulations in each ~/scratch/Simulations/<XX>
directory
and that any specific simulation can be run by calling:
$ bash ~/scratch/Simulations/<XX>/run
Note
We merely give that pattern as an example: you don’t have to conform to it, you just need to understand what they represent and do the translation to your own pattern yourself.
Submitting the job¶
You need to write a submission script for Balena’s scheduler (again, use nano to edit the file):
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=batch
#SBATCH --account=free
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=06:00:00
#SBATCH --export=ALL
module purge
module load group ce-molsim stack
module load taskfarmer
mpiexec -n 16 taskfarmer -f <task file>
Here, <task file>
is the full path leading to your task file, you only
request one node (there’s no point in requesting more) and it’s going to
run for 6 hours maximum.
To actually submit the job, type:
$ sbatch <submission script>