Be sure to do all exercises and run all completed code cells.

If anything goes wrong, restart the kernel (in the menubar, select Kernel\(\rightarrow\)Restart).


Manipulating Data and Files

About files

In this Section we will learn how to write and read data to and from files using Python.

We can then use this data to do useful things like plotting or calculations.

To use a file it has to be opened, and when finished it has to be closed.
While the file is open, it can either be read from or written to.

To open a file, we specify its name and indicate whether we want to read or write.

Files for Examples:

These examples and the ones later require us having some files to process, these are in the Files folder, so make sure you download the files in this if working on your own machine.

  • Make sure they are inside correctly named sub-folders on your hard disk.

  • Use the same capitalisation as the original for cross platform compatibility. (Many operating systems are case sensitive).

To find out where a Jupyter Notebook is running run the following command:

import os

here = os.getcwd() #get location of Current Working Directory
print(here)
  • You can then navigate to this folder using the file-browser on your operating system.

Organising Folders

When we create a new file by opening it and writing, the new file goes in the current working folder (also called a “directory”).
When we open a file for reading, Python also looks for it in the same place.

If we want to open a file somewhere else, we have to specify the path to the file, which is the name of the directory (or folder) where the file is located.

Usually we use relative paths, when we have a subfolder in the current working directory e.g.: Files/Work/mynotes.txt.
The directory above is ../ etc.

  • Use the same case (uppercase capitals etc.), which may not matter on Windows byt will affect the script on other operating systems and for marking the work.

A full (“absolute”) Windows path might be "c:/Users/Nick/words.txt" or "c:\\Users\\Nick\\words.txt", but do NOT use these in shared or submitted files only if you are ever going to use the script on a single machine.
Because backslashes are used to escape things like newlines and tabs, we need to write two backslashes in a string to get a real backslash \, so can use a forward slash instead (like web addresses).

  • We cannot use / or \ as part of a filename; they are reserved as a delimiter between directory and filenames.

Writing our First File

To open a file we use the open(NAME, MODE) function, which takes two arguments.
The first is the name of the file, and the second is called the mode.
Mode "w" means that we are opening the file for writing.

With mode "w", if there is no file named test.txt on the disk, it will be created.
If there already is one, it will be replaced by the file we are writing (be careful to not overwrite important files!).

myfile = open("workfile.txt", "w")
print(myfile)

The variable named myfile acts like a container holding all the contents copied to memory from the file on disk.
You can move information from the container a piece at time, or all at once until it is empty.

You can then use different methods on the file, using a dot as in: FILE.METHOD, and this makes changes to the myfile container.

The following line writes text into a file, using the write method. The file is written to the current working folder where the programme script is run from (See Section below on **Using the OS Module} for more on working folders).

myfile.write("This is a test...")
  • Try opening the new file from the main JupyterHub file menu or on your disk using the file browser.

    • notice that it will be empty.

    • This is because you haven’t yet saved the file to disk (like editing but not pressing save in a word processor).

To send any data to the file before we are finished we can use the flush method.

myfile.flush()
  • Now try reopening the file to see if the contents have appeared.

Using the close method will both save the data and close the file.

To write something other than a string, it needs to be converted to a string first:

value = 42
s = str(value)
myfile.write(s)
myfile.close()
  • Reopen the file and see the new contents.

Note: if you try to save something other than a string, or a variable that does not exist we will get an error.

The with statement

A safe way of writing to files is using the with statement and an indented block of code that does something to the file. This is safe because you do not need to specify close() at the end. The contents are written to the file whether the code exited normally or not. This means you do not lose any work in progress.

The general syntax for opening the file FILENAME and assigning it to a FILEHANDLE (like FH=FN.open()) is:

with open(FILEMNAME, MODE) as FILEHANDLE:
    DO FUN STUFF
    FILEHANDLE.write()
    OTHER STUFF
    ...
  • Note in the example below the “triple quotes” allow us to use a multi-line string.

textlines="""A bit of text I want to write to a file.
1. The first point I want to make is this.
2. Next I want to tell you that..."""

fname="outfile.txt"

with open(fname,"w") as f:
    f.write(textlines)
  • Just to prove that the file has been written and closed properly, read its contents, which we will cover in the next subsection

print(open(fname,"r").read())

The write method can be used repeatedly to print more output to the file, as shown in lines 2, 3 and 4 below.
In bigger programs, lines 2-4 will usually be replaced by a loop that writes many more lines into the file.

with open("test.txt", "w") as mynewfile:
    mynewfile.write("# A bit of text I want to write to a file.")
    mynewfile.write("1. The first point I want to make is this.")
    mynewfile.write("2. Next I want to tell you that...")
  • Now look in your working folder and you will be able to open the new file test.txt.

    • You will notice that the text is all on one line. This is because we didn’t tell Python to start a new line.

    • We do this using the special character "\n" (backslash n), which stands for “newline”.

    • This is like pressing enter or return on the keyboard. Other special characters exist, such as "\t" for tab.

  • Exercise: Try again, but putting a \n inside the end of each string

#with opener
# YOUR CODE HERE
    # YOUR CODE HERE
    # YOUR CODE HERE
    # YOUR CODE HERE

Now reload the file in a browser to see it.

It should look like this:

# A bit of text I want to write to a file.
1. The first point I want to make is this.
2. Next I want to tell you that...

Click here for solution

  • You will need the correct code run above for the following exercises to work!

Reading from Files

If a file exists on our disk, we can open it for reading.
This time, the mode argument is "r" for read:

thefile = open("test.txt", "r")
print(thefile)
  • This returns the contents of the file in a “container” object.

However, if we try to open a file for reading that doesn’t exist, we get an error:

anotherfile = open("iamnotafile.txt", "r")

We can write some code to catch errors and prevent them from raising an error message (a so called “exception”).

# This checks that the expected error is raised and raises an exception
fname = "iamnotafile.txt"

try: open(fname, "r")
except: print(f'File "{fname}" could not be opened.')

File Methods

There are a variety of methods for reading data and text from files, depending on the format of the data and how we want to use it. We can either read the contents of the whole file at once, or scan it in line-by-line.

Reading the Whole File

The read method returns the entire contents of the file, emptying the whole “container” into the string called this at the same time:

thefile = open("test.txt", "r")
this = thefile.read()
print(this)

If we try to read it a second time, the file container called thefile is now empty, so will return an empty string:

that = thefile.read()
print(that)

Reading Files a Line at a Time

Another file method is readline, which scans the contents line-by-line:

f = open("test.txt", "r")
print(0, f.readline())  # This will read the first line of the file.
print(1, f.readline())  # and next the second line
print(2, f.readline())  # and the last line
print(3, f.readline())  # the handle (container) is now empty...

The end="" argument prevents an extra newline being added:

f = open("test.txt", "r")
print(0, f.readline(), end="")  # This will read the first line of the file.
print(1, f.readline(), end="")  # and next the second line
print(2, f.readline(), end="")  # and the last line
print(3, f.readline(), end="")  # the handle (container) is now empty...

Iterative Methods for Scanning Files

A file can also be iterated over in the same way as a list.

  • Exercise: Put the previous code in a loop using for i in range(3): and changing the numbers for the counter i

# open the file
# YOUR CODE HERE
#FOR loop condition:
# YOUR CODE HERE
    #single print function
    # YOUR CODE HERE
#end of loop

Click for solution

Alternatively you can just iterate over the file contents directly:

the_file = open("test.txt", "r")

for each_line in the_file:
    print(each_line, end="")

Reading a File in to a List of Lines

It is often useful to fetch data from a disk file and turn it into a list of lines.
We can then perform useful tasks on this list.
The readlines method in line 2 reads all the lines and returns a list of the strings.

f = open("test.txt", "r")
list_of_lines = f.readlines()
print(list_of_lines)
print("The last line is:\n", list_of_lines[-1])

Sorting a File

This example sorts the lines of a file alphabetically (with capitalised words first).
We will use the friends.txt file we downloaded, which has a name per line.

  • Take a look at this file using a normal text editor.

First we read everything into a list of lines, then sort the list, and then write the sorted list back to another file:

thefile = open("Files/friends.txt", "r")
contentlist = thefile.readlines()
thefile.close()

print(contentlist)

Now we can sort the list alphanumerically:

contentlist.sort()  # sort is a list method, guess what it does before running this cell
list(contentlist) # spot the difference!

Now write back to another file:

with open("sortedfriends.txt", "w") as outfile:
    for next_entry in contentlist:
        outfile.write(next_entry)
  • Open the file using the browser to see its contents.

Exercise: File Reversing

Write a program that reads a file and writes out a new file with the lines in reversed order (i.e. the first line in the old file becomes the last one in the new file.)

  • Hint: use reverse indexing using:

range(-1,-N,-1)
#open the input file
# YOUR CODE HERE

# read into a list (called contents) using readlines
# YOUR CODE HERE

N = len(contents) # may be needed later...

# open your new file to write to, using a WITH statement:
# YOUR CODE HERE
    ## EVERYTHING INDENTED IN THE WITH BLOCK ##
    # loop backwards FOR the LENgth of the contents list obtained above
    # YOUR CODE HERE
        # write lines back to your file 
        # YOUR CODE HERE
    ## END OF THE WITH BLOCK ##

Click for solution

Filtering a File

Many useful line-processing programs will read a text file line-at-a-time and do some minor processing as they write the lines to an output file. They might number the lines in the output file, or insert extra blank lines after every 60 lines to make it convenient for printing on sheets of paper, or extract some specific columns only from each line in the source file, or only print lines that contain a specific substring.
We call this kind of program a filter.

Here is a filter that copies one file to another, omitting any lines that begin with #.

  • Have a look at the contents of the file Files/intext.txt to see what it looks like before running the next cell.

oldfilename = "Files/intext.txt"
newfilename = "outtext.txt"

infile = open(oldfilename, "r")

with open(newfilename, "w") as outfile:
    for text in infile:
        if text[0] != "#":
            outfile.write(text)
    outfile.close()
  • Look at outtext.txt to see what this did to the data.

Methods such as sorting and filtering are also very useful on numerical data.

Manipulating Numerical Data

Numerical data can be read in, processed and written to files in the same way as text.
Instead of using text methods we simply perform mathematical operation on the numbers.

Import of data using NumPy

Numerical data manipulation is made easier by importing and using the numerical module NumPy.
NumPy also has special methods for saving and loading purely numerical array data to and from files.

The following programme loads a column of text from the file "temps.txt" into an array using the ARRAY=np.loadtxt(FILENAME) method.

import numpy as np

data = np.loadtxt("Files/temps.txt")

print(data)

We can then take the mean (average) and standard deviation of the data using the .mean() function:

help(np.mean)
m = data.mean()
s = data.std()

print(f"mean = {m:.1f}, standard deviation = {s :.1f}")

Export of Data using Numpy

Continuing the last example we will now:

  1. Calculate the the difference of the individual temperature values from the mean.

  2. Save these to a new file using the np.savetxt(FILENAME, DATA) method.

devs = data-m # a new array with the differences
print(devs) #print the array
np.savetxt("deviations.txt", devs)
  • Look at the contents of the file from the file browser.

Manipulating CSV files

Numerical columns of data are commonly sored as Comma Separated Value files with a .csv extension

The file Files/weatherdata.csv contains hourly weather data in plain text, with values separated by commas (or other so called “delimiters”).

The file looks like this in a plain text editor:

Dry Bulb Temperature {C}, Dew Point Temperature {C}, Relative Humidity {%}, ...
2.50E+0, 1.2, 91, ...
5.70E+0, 3.5, 86, ...
7.90E+0, 4.7, 80, ...
... , ... , ...,

But when opened in Excel appears as a table like this:

Dry Bulb Temperature {C}

Dew Point Temperature {C}

Relative Humidity {%}

2.5

1.2

91

5.7

3.5

86

7.9

4.7

80

8.7

5.6

81

8.9

5.8

81

The np.loadtxt() function method can read the values into a data array by telling .loadtxt how the data is laid out in the .csv file.

  • The keyword argument delimiter=<SOME_STRING> tells numpy how the data is separated (without this option it assumes a space).

  • The keyword argument skiprows=1 is used to ignore the first row, which is the non-numerical header.

import numpy as np

# delimiter=',' tells .loadtxt that the values are separated with commas (rather than spaces)
filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)

print(filedata) #will show only the head (top rows) and tail (bottom rows) of a long array

Each row is an hour’s weather data, with the temperatures in the first column.

Slicing data from an array

Numpy arrays can be sliced using ARRAYNAME[<STARTROW>:<ENDROW>, <STARTCOL>:<ENDCOL>] for example:

  • Taking the value in the 1st row (i=0) and third column (j=2):

filedata[0, 2]

91.0

  • Taking the value in the second row (i=1), from the second (j=1) to third (j=2) column:

filedata[1, 1:3] #note the end position is not included

array([ 3.5, 86. ])

  • Values before the fourth column (j=3) in the last row:

filedata[-1, :3]

array([ 4.3,  2. , 85. ])

  • the entire second column (i=1):

filedata[:, 1]

array([1.2, 3.5, 4.7, ..., 3.4, 2.8, 2. ])

  • Note that an empty value in a A:B specifier takes value to the end of the row or column.


Example: Calculating the average temperature:

import numpy as np

filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)
temperatures = filedata[:, 0] # take the first (temperature) column

average = temperatures.mean()
print(f"Average temperature: {average:.2f} degrees C")

Saving Numerical Only Data

We can instead write the output to a file using np.savetxt(FILENAME, DATA, <keyword=options>) using the keyword options:

  • delimiter = "," to use commas in the .csv file (if there are multiple columns)

  • fmt="%.2f" to format the numbers as floats with 2 decimal places (similar to the format method above).

np.savetxt("temperatures.csv", temperatures, delimiter = ",", fmt="%.1f")

The file contents look like this:

2.5
5.7
7.9
.
.
.
  • Download it and open it in Excel

Exercise: Deviations from the average.

  1. Load the weather data from the file.

  2. Slice out the first column of temperatures.

  3. Take the average value of the temperatures.

  4. Subtract the average value from the temperatures array to give the deviations (\(d = T - \mu(T)\))

  5. Save this back to a file called "deviations.csv" with 1dp floating point precision

import numpy as np

# load the weather file
# YOUR CODE HERE

# slice column 0
# YOUR CODE HERE

# obtain the mean
# YOUR CODE HERE

# calculate the deviations
# YOUR CODE HERE

# save back to a CSV text file
# YOUR CODE HERE
  • Opening the file should have the contents:

-7.7
-4.5
-2.3
...

Click for solution

Processing Multiple Files

Scripts can allow you to process many files in one go. You can split up a single file into many, join data from lots of files into one place or plot data to a range of figures in one go.

Using the os module.

A nice module for working with our Operating System is the os module.
This allows us to see/change our current working location as well as make new folders.
To view your current working directory use the .getcwd function method:

import os
myWD = os.getcwd()

print(myWD)

The method .listdir(FOLDER) allows us to list the contents of a directory (FOLDER).

  • Note that when referring to the current folder we can use the string "." and the one above we can use ".."

Try the following command:

contents = os.listdir(myWD)  # lists the contents of the working directory
print(contents)
os.listdir("Files") # this will only work if the `Files` folder exists
  • the os.mkdir(<FOLDERNAME>) tries to create a new directory named whatever string you replace <FOLDERNAME> with.

  • We use try: and except: to catch and ignore any errors such as the folder already existing.

Weekly Weather files:

import numpy as np
import os

filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)
temperatures = filedata[:, 0]

folder = "Weather/"
try: os.mkdir(folder) # try to make the new folder if it doesn't exist
except: pass # if the folder already exists move on

hours_per_week = 24*7

# count for 52 weeks:
for i in range(52):
    start_hour = i*hours_per_week #takes values: 0, hours_per_week, 2*hours_per_week, ...
    end_hour = start_hour + hours_per_week
    weekly_temperatures = temperatures[start_hour:end_hour] #slice out the hours for that week
    
    # make a two digit week number 01, 02, 03, ..., 50, 51, 52
    week_number = i+1
    week_string = str(week_number).zfill(2) # fill with leading zeros to make all two digits
    
    # create a new file
    newfilename = f"temp_week{week_string}.txt"
    filepath = folder+newfilename
    np.savetxt(filepath, weekly_temperatures, fmt="%.1f")

The Weather folder should now contain 52 files, each with a week’s worth of hourly temperature data as a single column.

wfiles = os.listdir("Weather")
print(wfiles)

Task 7: File Manipulation (2%)

For this task you will read in a CSV energy file and produce a new file of daily totals.

  • Use previous examples to guide you in solving the various parts of solving this problem.

  • Take it step by step. Do the first step and check it works before doing anything else and so on.

  • Use lots of print() functions when developing your code, but remove them in the final version.

Task: Energy Data

The data in the file: Files/houseenergy.csv is in the following format:

month

day

hour

elec

gas

1

1

1

0

0.746

1

1

2

0

0.672

1

1

3

0

0.075

For the task you must do the following:

  1. Load the numerical data into a Numpy array

    • Make sure you load the file from a folder called Files with an uppercase F or it will fail on the grading server.

  2. Slice out the columns for Electricity and Gas

  3. Add them together to work out the total energy each hour

  4. Open a file to write the new data to

  5. Work out the hourly totals for each day (use a similar method as in the weekly weather example)

  6. Sum them to give the total energy for that day

  7. Write this sum to a line of the outfile inside the for loop, using syntax like:

outfile = ??? # open the outfile to write 

for day in range(365):
    start_hour = ???
    end_hour = ???
    daily_energy = ???
    ??? # write the daily_energy to the outfile
    
???  # close the outfile
    

Or alternatively:

???: # use the WITH method to open the outfile
    for day in range(365):
        start_hour = ???
        end_hour = ???
        daily_energy = ???
        ??? # write the daily_energy to the outfile
  • You will need to format the lines properly to have a new line after each value.

The plain text file "energy_totals.txt" should contain simple numerical lines like:

81.121
79.466
108.238
...
  • Note: this is rounded to 3 d.p.

  • There should be nothing but a column of numbers

    • no commas, no text, no units…

  • Hint: you will need to use the newline specifier “\n” when writing lines individually.

import numpy as np

# Import file to data array
# YOUR CODE HERE

# Slice out the columns for Electricity and Gas
# YOUR CODE HERE

# Add them together to work out the total energy each hour
# YOUR CODE HERE

# open an outfile 
    # for each day of the year
    #   work out the hourly totals for each day
    #   sum them to get the daily total
    #   writing each day's single total one per line
# YOUR CODE HERE

The script below checks if your code has produced the desired output file and its contents are as expected.

import sys
sys.path.append(".checks")
import check07
    
try: student_file = np.loadtxt("energy_totals.txt")
except OSError as e: print(e, 
    "\nDid you save the data to the correct file in the current working folder?")
except: pass

check07.test()








Extra Example: Plotting Data from a Set of Files

1. Plotting a figure for each file

  • study the code below and try to understand what is happening on each line.

import os, matplotlib.pyplot as plt, numpy as np

dirname = "Weather/"
file_list = os.listdir(dirname)

try: os.mkdir("WeatherFigs")
except: pass

fig,ax = plt.subplots(figsize=(10,5))
for file_name in sorted(file_list):
    data = np.loadtxt(dirname+file_name)
    ax.plot(data)
    ax.set_xlabel("Time (h)")
    ax.set_ylabel("Temperature ($^\circ$C)")
    figname = file_name.replace(".txt", ".png")
    fig.savefig("WeatherFigs/"+figname)
    ax.cla() # clear the axis content to start a new graph
  • Now look in the new WeatherFigs folder to see the image files.

2. Collecting data from many files into one figure

First we will get the data from all the files created earlier and put each week’s average into a new list (the example continues below):

import os
import numpy as np

dirname = "Weather/"
file_list = os.listdir(dirname)
avdata = []

# sort the file list so the weeks are in order (needed later)
for file_name in sorted(file_list):
    weekdata = np.loadtxt(dirname+file_name)
    av = np.mean(weekdata)
    avdata.append(av)

print(avdata)

Continued… Next we:

  1. Calculate the average over the whole year (the average of all week’s averages).

  2. Loop through each week looking if they are above or below the yearly average, and then

  3. Plot them as red points if hotter and blue points if colder.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10,5))

yearav = np.mean(avdata)

wnum = 0  # counter for week number
for weekav in avdata:
    wnum = wnum + 1
    if weekav >= yearav:
        #plot in red
        ax.plot(wnum, weekav, "ro")
    else:
        #plot in blue
        ax.plot(wnum, weekav, "bo")
        
#add some formatting and an average line then save figure
textstring = f"mean = {yearav:.1f}$^\circ$C"
ax.text(1, yearav+0.2, textstring, size=14)
xpts = [0, 52]  # two points for x values for the average line
ypts = [yearav, yearav]  # two points for y values for the average
ax.plot(xpts, ypts, "g-")  # plot the average line in green
ax.axis([0, 52, 0, 20])
ax.set_xlabel("Week Number")
ax.set_ylabel("Temperatures (Deg C)")
fig.savefig("weekly_temperatures.png")
fig.show()