Awk

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What is awk?

  • How can I use awk to modify files?

  • How can I use awk to prototype simple analyses?

Objectives
  • Understand that awk is a standalone program language.

  • Understand that it is a data driven scripting tool.

  • Be able to use awk to modify files and perform arithmetic.

  • Be able to use awk to prototype simple analyses.

  • Know that there awk has a range of special variables to help you develop scripts.

  • While awk is powerful, scripts can quickly become unweildly and alternative approaches may provide clearer solutions.

Wikipedia tells us that awk is a programming language in its own right. It is a data driven language, which means that like sed it operates on each line of the input file in turn. This means that the main script is executed on every line in the input file. The benefit of awk is that you do not have to worry about reading in files, but(do not accept the word of dinosaurs!), scripts can quickly become complex and obtuse, and you are encouraged to use alternative approaches for more complex functionality.

As with other unix/linux programs it has an detailed manual available online. Other useful resources for awk include: 1 and 2 but a huge number are available including several books.

Awk as grep

The easiest way to learn how awk works is to run a few example cases. We will start by showing how awk can be used to search for specific strings. Make sure you are in the ‘molecules’ directory using pwd, and change to it if necessary with cd ~/data-shell/molecules.

Compare the output of:

$ grep ATOM methane.pdb

and

$ awk '/ATOM/' methane.pdb

Used in this way awk processes each line of the file in turn and prints out the entire line, exactly as grep.

Awk is sed

In the previous episode we used sed to replace the atom label ‘H’ with ‘D’. In order to ensure that sed did not change all instances of ‘H’ we had to include spaces or whitespace characters either side of ‘H’ Awk allows us to reference specific fields in a file. As it reads in each line, by default awk splits the line into ‘fields’ by using white space as a field spacer, or delimeter. Awk references each field, or columen, using ‘$’ and its number, so to select all lines containing ‘H’ we could use:

$ awk '$3 ~ /H/' methane.pdb 

This has identified the lines we want to change but we wish to alter these lines. To do this we need to start using awk’s scripting capability. In awk the script is placed within {}, for instance to perform the same functionality as above we can use:

$ awk '{if($3=="H"){print $0;};}' methane.pdb

This line introduces two new forms of functionality. if statements in awk have the form if ( condition ) { execute some code }, the condition is in parentheses and the code in curly braces. Notice also that we have put H in quotes, "H", this is to specify that it is a string. We have also introduced the special variable $0, which you can see refers to the whole line, or all fields in the line. Finally we note that each line is terminated with a ; Already we can see that this comman is beginning to look a little unweildy so as before we will develop it further in a script. Create a new script with deuterate.awk and enter the following:

#!/bin/bash

awk '{
    if($3=="H"){
        print $0;
    };
}' methane.pdb

Don’t forget that to make the script executable with chmod +x ..., and run it to confirm that this works with ./my_script.awk. This is still not what we want however as we still haven’t changed the hydrogens and ther other lines aren’t being printed. Awk allows us to modify the fields however so we will edit the script again:

#!/bin/bash

awk '{
    if($3=="H"){
        $3="D";
    };
    print $0;
}' methane.pdb

Notice the difference between the conditional ==, is this equal to that, and =, set this equal to that. Also we have moved print $0; outside of the conditional, so that it is executed on every line in the input file.

What’s wrong with this script?

There are at least three aspects that are wrong with this script, or not as good as they could be. See if you can identify each of the issues and edit the script to fix them.

Solution

  1. There are no comments!
  2. This script is not general!
  3. The formatting has been changed so that whitespace has been replaced with a single space character.
#!/bin/bash

# A script to replace hydrogens with deuterium in a pdb file
# Identifies 'H' atoms and replaces them with 'D'
# Usage ./deutrate.ack filename.pdb
# James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk#!/bin/bash

awk 'OFS="\t" {
    if($3=="H"){
        $3="D";
    };
   print $0;
}' $1

This solution introduces several new features;

Problem 3. OFS is a special variable, the output field separator. By setting this to \t, the tab character this makes the output format more similar to the input, though not perfect. Note that as with the search options we saw earlier, the OFS is set outside the main awk script. In addition to OFS there is also a special variable FS for the input file separator. By default awk uses whitespace (spaces(s) and tabs), but e.g. for csv files you can set FS="," to process your data.

Problem 2. In order to make the script more general, we have replaced the specific, methane.pdb with $1. Outside of the awk script $1 refers to the first argument passed to the script. It is only inside the awk script that it refers to the field.

We can try and improve the output format further by ‘touching’ a line without changing any of its fields. Doing this means that the original line may have been modified so awk outputs with the new formatting. Also can we think how to make the script more robust?

~~~ #!/bin/bash

A script to replace hydrogens with deuterium in a pdb file

Identifies ‘H’ atoms and replaces them with ‘D’

Usage ./deutrate.ack filename.pdb

James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk

awk ‘OFS=”\t” { if($3==”H”&&$1==”ATOM”){ $3=”D”; } else { $1=$1; }; print $0; }’ $1

By testing if the first field is “ATOM” we are being more specific in that the field we change must be an “ATOM” labelled “H”.

Awk as wc

We have already seen a number of the special variables in awk, now we will lcover some more. We have used wc -n to count the number of lines in a file. Let’s run the following command:

$ awk '{print NR;}' methane.pdb 

awk has a special variable NR, which is the line number, or record number (hence NR not ‘NL’). However if we want the number of lines in the file we just want the last line number to be printed, i.e. only at the end of the script. In addition to the main script executed on every line awk has BEGIN and END options:

$ awk 'BEGIN (print NR;} {print NR;} END {print NR;}' methane.pdb 

Modify this so that it just outputs the number of lines. As well as NR there is the variable NF which is the number of fields.

NF as variable

Run the following command and make sure you understand what is happening:

$ awk '(print NR, NF, $NF;}' methane.pdb

Solution

For each line this prints the record number, the number of fields, and the content of the final field. NF can be used as a variable to reference a field, indeed any (integer) variable can be used in this way.

Can you write an awk script to count the number of ‘words’ in a file.
Hint: assume that number of words is the number of fields.

Solution

#!/bin/bash

An awk script to output number of lines and words in a file

Usage: ./wc.awk filename

Output: #lines #words

James Grant, RSE, University of Bath, r.j.grant@bath.ac.uk

awk ‘{ words+=NF; } END { print NR, words; }’ $1

Compare the output with wc.

#What’s awk for?

As you would expect from a complete programminf language, awk also has functionality for for loops:

$ awk '{for(field=1;field<=NR;field++){print $field;}}' methane.pdb

Unlike in bash, awk requires us to specify as aspects of the loop we are performing. We must specify a variable to index the loop, in this case ‘field’, and we start with field=1 and continue until field<=NF that is until it is greater than NF, the number of fields in the line. Each time through the loop we execute field++ which increments the field variable, i.e. field=field+1. Finally every time we go through the loop we print the contents of the the ‘fieldth’ field.

Some example problems

Change to the data folder, cd ~/data-shell/data. This contains a number of files in different formats. animal-counts/animal.txt: can you count the total number of animals planets.txt: can you write a script which calculates the total mass of the planets, how about the average period? sunspots.txt: Can you write a script which counts up the number of sunspots in the year 2000 and outputs just the total? Sunspots.txt: Can you generalise this to any year? Hint: You can pass variables into awk with awk -v year = $argument {...}

Solutions

#!/bin/bash
# Count the total number of animals in animals.txt
# Number of each animal is given in the last field

awk '{
    count+=$NF;
}
END {
print "The total number of animals is",count;
}'

#!/bin/bash

Calculate Total mass and average period of planets in planet.txt

Output: #total mass #average period

Since the file encloses all fields in quotes we need awk to remove these during the read.

We can match the string with a regular expression

We need to match "," and ",.

In regex the quote is ‘"’ and we can use the ? to specify that we expect 0, or 1 of the preceeding character

Hence we match with ","?

awk ‘FS=”","?” { if(NR>1){ totalmass+=$2; totalperiod+=$4; planets++; } } END{ print totalmass,totalperiod/planets; }’ planets.txt ~~~

#!/bin/bash #Count the number of sunspots in given year from sunspots.txt

Data starts after record 3, year is field 3, sunspots per month is in the last field

Usage: ./sunspots.awk #year

awk -v year=$1 ‘{ if( NR > 3 && $3==year){ total+=$NF; } } END { print “Total sunspots in”,year,”=”,total; }’ sunspot.txt

Key Points

  • Awk is a powerful tool that can be used to modify files and perform analysis.

  • Awk operates on each line in turn.

  • As a data processor, awk can allow you to quickly prototype solutions.

  • As soon as you have awk scripts running over several lines, you should consider other approaches.