Awk
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What is awk?
How can I use awk to modify files?
How can I use awk to prototype simple analyses?
Objectives
Understand that awk is a standalone program language.
Understand that it is a data driven scripting tool.
Be able to use awk to modify files and perform arithmetic.
Be able to use awk to prototype simple analyses.
Know that there awk has a range of special variables to help you develop scripts.
While awk is powerful, scripts can quickly become unweildly and alternative approaches may provide clearer solutions.
Wikipedia tells us that awk
is a programming language in its own right.
It is a data driven language, which means that like sed
it operates on each line of the input file in turn.
This means that the main script is executed on every line in the input file.
The benefit of awk
is that you do not have to worry about reading in files, but(do not accept the word of dinosaurs!), scripts can quickly become complex and obtuse, and you are encouraged to use alternative approaches for more complex functionality.
As with other unix/linux programs it has an detailed manual available online. Other useful resources for awk include: 1 and 2 but a huge number are available including several books.
Awk as grep
The easiest way to learn how awk
works is to run a few example cases.
We will start by showing how awk
can be used to search for specific strings.
Make sure you are in the ‘molecules’ directory using pwd
, and change to it if necessary with cd ~/data-shell/molecules
.
Compare the output of:
$ grep ATOM methane.pdb
and
$ awk '/ATOM/' methane.pdb
Used in this way awk processes each line of the file in turn and prints out the entire line, exactly as grep.
Awk is sed
In the previous episode we used sed
to replace the atom label ‘H’ with ‘D’.
In order to ensure that sed did not change all instances of ‘H’ we had to include spaces or whitespace characters either side of ‘H’
Awk allows us to reference specific fields in a file.
As it reads in each line, by default awk
splits the line into ‘fields’ by using white space as a field spacer, or delimeter.
Awk references each field, or columen, using ‘$’ and its number, so to select all lines containing ‘H’ we could use:
$ awk '$3 ~ /H/' methane.pdb
This has identified the lines we want to change but we wish to alter these lines.
To do this we need to start using awk’s scripting capability.
In awk
the script is placed within {}
, for instance to perform the same functionality as above we can use:
$ awk '{if($3=="H"){print $0;};}' methane.pdb
This line introduces two new forms of functionality.
if
statements in awk
have the form if ( condition ) { execute some code }
, the condition is in parentheses and the code in curly braces.
Notice also that we have put H
in quotes, "H"
, this is to specify that it is a string.
We have also introduced the special variable $0, which you can see refers to the whole line, or all fields in the line.
Finally we note that each line is terminated with a ;
Already we can see that this comman is beginning to look a little unweildy so as before we will develop it further in a script.
Create a new script with deuterate.awk
and enter the following:
#!/bin/bash
awk '{
if($3=="H"){
print $0;
};
}' methane.pdb
Don’t forget that to make the script executable with chmod +x ...
, and run it to confirm that this works with ./my_script.awk
.
This is still not what we want however as we still haven’t changed the hydrogens and ther other lines aren’t being printed.
Awk allows us to modify the fields however so we will edit the script again:
#!/bin/bash
awk '{
if($3=="H"){
$3="D";
};
print $0;
}' methane.pdb
Notice the difference between the conditional ==
, is this equal to that, and =
, set this equal to that.
Also we have moved print $0;
outside of the conditional, so that it is executed on every line in the input file.
What’s wrong with this script?
There are at least three aspects that are wrong with this script, or not as good as they could be. See if you can identify each of the issues and edit the script to fix them.
Solution
- There are no comments!
- This script is not general!
- The formatting has been changed so that whitespace has been replaced with a single space character.
#!/bin/bash # A script to replace hydrogens with deuterium in a pdb file # Identifies 'H' atoms and replaces them with 'D' # Usage ./deutrate.ack filename.pdb # James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk#!/bin/bash awk 'OFS="\t" { if($3=="H"){ $3="D"; }; print $0; }' $1
This solution introduces several new features;
Problem 3. OFS is a special variable, the output field separator. By setting this to
\t
, the tab character this makes the output format more similar to the input, though not perfect. Note that as with the search options we saw earlier, the OFS is set outside the main awk script. In addition to OFS there is also a special variableFS
for the input file separator. By defaultawk
uses whitespace (spaces(s) and tabs), but e.g. for csv files you can setFS=","
to process your data.Problem 2. In order to make the script more general, we have replaced the specific,
methane.pdb
with$1
. Outside of the awk script$1
refers to the first argument passed to the script. It is only inside the awk script that it refers to the field.We can try and improve the output format further by ‘touching’ a line without changing any of its fields. Doing this means that the original line may have been modified so
awk
outputs with the new formatting. Also can we think how to make the script more robust?~~~ #!/bin/bash
A script to replace hydrogens with deuterium in a pdb file
Identifies ‘H’ atoms and replaces them with ‘D’
Usage ./deutrate.ack filename.pdb
James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk
awk ‘OFS=”\t” { if($3==”H”&&$1==”ATOM”){ $3=”D”; } else { $1=$1; }; print $0; }’ $1
By testing if the first field is “ATOM” we are being more specific in that the field we change must be an “ATOM” labelled “H”.
Awk as wc
We have already seen a number of the special variables in awk
, now we will lcover some more.
We have used wc -n
to count the number of lines in a file.
Let’s run the following command:
$ awk '{print NR;}' methane.pdb
awk
has a special variable NR
, which is the line number, or record number (hence NR
not ‘NL’).
However if we want the number of lines in the file we just want the last line number to be printed, i.e. only at the end of the script.
In addition to the main script executed on every line awk
has BEGIN
and END
options:
$ awk 'BEGIN (print NR;} {print NR;} END {print NR;}' methane.pdb
Modify this so that it just outputs the number of lines.
As well as NR
there is the variable NF
which is the number of fields.
NF as variable
Run the following command and make sure you understand what is happening:
$ awk '(print NR, NF, $NF;}' methane.pdb
Solution
For each line this prints the record number, the number of fields, and the content of the final field.
NF
can be used as a variable to reference a field, indeed any (integer) variable can be used in this way.Can you write an awk script to count the number of ‘words’ in a file.
Hint: assume that number of words is the number of fields.Solution
#!/bin/bash
An awk script to output number of lines and words in a file
Usage: ./wc.awk filename
Output: #lines #words
James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk
awk ‘{ words+=NF; } END { print NR, words; }’ $1
Compare the output with
wc
.
#What’s awk
for?
As you would expect from a complete programminf language, awk
also has functionality for for
loops:
$ awk '{for(field=1;field<=NR;field++){print $field;}}' methane.pdb
Unlike in bash, awk
requires us to specify as aspects of the loop we are performing.
We must specify a variable to index the loop, in this case ‘field’, and we start with field=1
and continue until field<=NF
that is until it is greater than NF
, the number of fields in the line.
Each time through the loop we execute field++
which increments the field variable, i.e. field=field+1
.
Finally every time we go through the loop we print the contents of the the ‘fieldth’ field.
Some example problems
Change to the data folder, cd ~/data-shell/data
.
This contains a number of files in different formats.
animal-counts/animal.txt: can you count the total number of animals
planets.txt: can you write a script which calculates the total mass of the planets, how about the average period?
sunspots.txt: Can you write a script which counts up the number of sunspots in the year 2000 and outputs just the total?
Sunspots.txt: Can you generalise this to any year? Hint: You can pass variables into awk with awk -v year = $argument {...}
Solutions
#!/bin/bash # Count the total number of animals in animals.txt # Number of each animal is given in the last field awk '{ count+=$NF; } END { print "The total number of animals is",count; }'
#!/bin/bash
Calculate Total mass and average period of planets in planet.txt
Output: #total mass #average period
Since the file encloses all fields in quotes we need awk to remove these during the read.
We can match the string with a regular expression
We need to match
","
and",
.In regex the quote is ‘"’ and we can use the
?
to specify that we expect 0, or 1 of the preceeding characterHence we match with ","?
awk ‘FS=”","?” { if(NR>1){ totalmass+=$2; totalperiod+=$4; planets++; } } END{ print totalmass,totalperiod/planets; }’ planets.txt ~~~
#!/bin/bash #Count the number of sunspots in given year from sunspots.txt
Data starts after record 3, year is field 3, sunspots per month is in the last field
Usage: ./sunspots.awk #year
awk -v year=$1 ‘{ if( NR > 3 && $3==year){ total+=$NF; } } END { print “Total sunspots in”,year,”=”,total; }’ sunspot.txt
Key Points
Awk is a powerful tool that can be used to modify files and perform analysis.
Awk operates on each line in turn.
As a data processor, awk can allow you to quickly prototype solutions.
As soon as you have awk scripts running over several lines, you should consider other approaches.