How do you submit a job?¶
Thomas Alexander <t.c.alexander@bath.ac.uk> and Gaël Donval <g.donval@bath.ac.uk>
You should currently be on one of the two front (or login) nodes on
Balena, named either balena-01
or balena-02
as you can see on your
prompt or by typing:
$ hostname
These login nodes are only intended as entry points to the cluster: running simulations or other long-running/demanding programs on them is strictly forbidden. The correct way to leverage Balena’s power is by submitting Bash scripts to the job scheduler.
Writing a submission script¶
These scripts are simple text files containing information about resource allocation and program execution. A typical script looks like this:
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --partition=batch
#SBATCH --account=free
#SBATCH --time=06:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
module purge
module load group ce-molsim stack
module load taskfarmer
cd ~/scratch
cd Simulation
mpirun -np 16 taskfarmer -f task_list -v -r
The first line, starting with #!
, tells the job scheduler what kind of
script is going to be provided (here a Bash script). The lines starting with
#SBATCH
are requests for resources at the sole discretion of the scheduler:
the job name is set to test, the scheduler can choose computing nodes out of
a partition called batch, we are asking for free nodes rather than paid-for
priority nodes, the maximum execution time is of 6 hours, we only want one node
and we request all 16 cpu cores of it.
The lines starting with module
are used to make programs available on the
computing nodes as described in How do you access specific programs?: in this case
we load taskfarmer
, but it could be any program you need in your simulations
such as gnuplot
, music
, raspa
or gromacs/2019.2
for instance.
The rest of the lines constitutes the script itself: instructions executed in order, one after the other, given the requested resources. Such a script can be submitted to the scheduler using commands described later in this document.
Submitting a job¶
Given a working submission script called test_job.sub
available in the
directory ~/scratch/simulations/
for instance,
you can submit it to the scheduler by invoking:
$ sbatch ~/scratch/simulations/test_job.sub
The computer should return an answer stating that the submission was fine and giving you the job number for future reference. You don’t really need to store that number anywhere as you can get it back at any time.
Probing the state of your jobs¶
To get the status of your jobs, you need to type:
$ squeue -u <username>
The columns ST
(state) and START_TIME
are the most interesting ones.
Being in state R
means that the job is currently running and P
means
pending, in which case you may want to have a look at the estimated starting
time to know when your calcualtion is scheduled to start. For more details,
have a look at squeue
’s manual page:
$ man squeue
You can search for the job state codes for instance by pressing /JOB STAT CODES followed by n or Shift+n to look for the next or previous reference, respectively, until you reach the right section.
Getting the status of your calculation¶
Most programs output data every so often so that you can follow the status of
the calculation. You can get access to that by looking at the files named
<job-name>.out
and <job-name>.err
. Given the example script
provided at the beginning of this section, you should look for test.out
and test.err
.
The two usual ways to follow the output of a calculation from a file is
either using less
and press Shift-f (use Ctrl-c to get back
to normal mode and q to quit); or using tail
which gives the last
lines of a file (10 by default) but can be made to follow a file
by using the -f
flag. That way, you can display the file in its current
state and then continue to show new lines as they are added.
This is helpful in following outputs during the course of a simulation.
Cancelling a job¶
If you want to cancel a running or pending calculation, you can get its id
by using squeue
as described above and then cancel it by typing:
$ scancel <job-id>
Once cancelled, you can move or delete whatever job-related file/directory you want: never change files in a folder where a calculation is running.