How do you submit a job?

Thomas Alexander <t.c.alexander@bath.ac.uk> and Gaël Donval <g.donval@bath.ac.uk>

You should currently be on one of the two front (or login) nodes on Balena, named either balena-01 or balena-02 as you can see on your prompt or by typing:

$ hostname

These login nodes are only intended as entry points to the cluster: running simulations or other long-running/demanding programs on them is strictly forbidden. The correct way to leverage Balena’s power is by submitting Bash scripts to the job scheduler.

Writing a submission script

These scripts are simple text files containing information about resource allocation and program execution. A typical script looks like this:

#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=batch
#SBATCH --account=free
#SBATCH --time=06:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

module purge
module load group ce-molsim stack
module load taskfarmer

cd ~/scratch
cd Simulation

mpirun -np 16 taskfarmer -f task_list -v -r

The first line, starting with #!, tells the job scheduler what kind of script is going to be provided (here a Bash script). The lines starting with #SBATCH are requests for resources at the sole discretion of the scheduler: the job name is set to test, the scheduler can choose computing nodes out of a partition called batch, we are asking for free nodes rather than paid-for priority nodes, the maximum execution time is of 6 hours, we only want one node and we request all 16 cpu cores of it.

The lines starting with module are used to make programs available on the computing nodes as described in How do you access specific programs?: in this case we load taskfarmer, but it could be any program you need in your simulations such as gnuplot, music, raspa or gromacs/2019.2 for instance.

The rest of the lines constitutes the script itself: instructions executed in order, one after the other, given the requested resources. Such a script can be submitted to the scheduler using commands described later in this document.

Submitting a job

Given a working submission script called test_job.sub available in the directory ~/scratch/simulations/ for instance, you can submit it to the scheduler by invoking:

$ sbatch ~/scratch/simulations/test_job.sub

The computer should return an answer stating that the submission was fine and giving you the job number for future reference. You don’t really need to store that number anywhere as you can get it back at any time.

Probing the state of your jobs

To get the status of your jobs, you need to type:

$ squeue -u <username>

The columns ST (state) and START_TIME are the most interesting ones. Being in state R means that the job is currently running and P means pending, in which case you may want to have a look at the estimated starting time to know when your calcualtion is scheduled to start. For more details, have a look at squeue’s manual page:

$ man squeue

You can search for the job state codes for instance by pressing /JOB STAT CODES followed by n or Shift+n to look for the next or previous reference, respectively, until you reach the right section.

Getting the status of your calculation

Most programs output data every so often so that you can follow the status of the calculation. You can get access to that by looking at the files named <job-name>.out and <job-name>.err. Given the example script provided at the beginning of this section, you should look for test.out and test.err.

The two usual ways to follow the output of a calculation from a file is either using less and press Shift-f (use Ctrl-c to get back to normal mode and q to quit); or using tail which gives the last lines of a file (10 by default) but can be made to follow a file by using the -f flag. That way, you can display the file in its current state and then continue to show new lines as they are added. This is helpful in following outputs during the course of a simulation.

Cancelling a job

If you want to cancel a running or pending calculation, you can get its id by using squeue as described above and then cancel it by typing:

$ scancel <job-id>

Once cancelled, you can move or delete whatever job-related file/directory you want: never change files in a folder where a calculation is running.