Be sure to do all exercises and run all completed code cells.
If anything goes wrong, restart the kernel (in the menubar, select Kernel\(\rightarrow\)Restart).
Manipulating Data and Files¶
About files¶
In this Section we will learn how to write and read data to and from files using Python.
We can then use this data to do useful things like plotting or calculations.
To use a file it has to be opened, and when finished it has to be closed.
While the file is open, it can either be read from or written to.
To open a file, we specify its name and indicate whether we want to read or write.
Files for Examples:¶
These examples and the ones later require us having some files to process, these are in the Files
folder, so make sure you download the files in this if working on your own machine.
Make sure they are inside correctly named sub-folders on your hard disk.
Use the same capitalisation as the original for cross platform compatibility. (Many operating systems are case sensitive).
To find out where a Jupyter Notebook is running run the following command:¶
import os
here = os.getcwd() #get location of Current Working Directory
print(here)
You can then navigate to this folder using the file-browser on your operating system.
Organising Folders¶
When we create a new file by opening it and writing, the new file goes
in the current working folder (also called a “directory”).
When we open a file for reading, Python also looks for it in the
same place.
If we want to open a file somewhere else, we have to specify the path to the file, which is the name of the directory (or folder) where the file is located.
Usually we use relative paths, when we have a subfolder in the current working directory e.g.: Files/Work/mynotes.txt
.
The directory above is ../
etc.
Use the same case (uppercase capitals etc.), which may not matter on Windows byt will affect the script on other operating systems and for marking the work.
A full (“absolute”) Windows path might be "c:/Users/Nick/words.txt"
or
"c:\\Users\\Nick\\words.txt"
, but do NOT use these in shared or submitted files only if you are ever going to use the script on a single machine.
Because backslashes are used to escape things like newlines and tabs, we need to write two backslashes in a
string to get a real backslash \
, so can use a forward slash instead (like web addresses).
We cannot use
/
or\
as part of a filename; they are reserved as a delimiter between directory and filenames.
Writing our First File¶
To open a file we use the open(NAME, MODE)
function, which takes two arguments.
The first is the name of the file, and the second is called the mode.
Mode "w"
means that we are opening the file for writing.
With mode "w"
, if there is no file named test.txt
on the disk, it
will be created.
If there already is one, it will be replaced by the
file we are writing (be careful to not overwrite important files!).
myfile = open("workfile.txt", "w")
print(myfile)
The variable named myfile
acts like a container holding all the contents copied to memory from the file on disk.
You can move information from the container a piece at time, or all at once until it is empty.
You can then use different methods on the file, using a dot as in: FILE
.METHOD
,
and this makes changes to the myfile
container.
The following line writes text into a file, using the write
method. The file is written to the current working folder where the programme script is run from (See Section below on **Using the OS Module} for more on working folders).
myfile.write("This is a test...")
Try opening the new file from the main JupyterHub file menu or on your disk using the file browser.
notice that it will be empty.
This is because you haven’t yet saved the file to disk (like editing but not pressing save in a word processor).
To send any data to the file before we are finished we can use the flush
method.
myfile.flush()
Now try reopening the file to see if the contents have appeared.
Using the close
method will both save the data and close the file.
To write something other than a string, it needs to be converted to a string first:
value = 42
s = str(value)
myfile.write(s)
myfile.close()
Reopen the file and see the new contents.
Note: if you try to save something other than a string, or a variable that does not exist we will get an error.
The with
statement¶
A safe way of writing to files is using the with
statement and an indented block of code that does something to the file.
This is safe because you do not need to specify close()
at the end. The contents are written to the file whether the code exited normally or not. This means you do not lose any work in progress.
The general syntax for opening the file FILENAME and assigning it to a FILEHANDLE (like FH=FN.open()
) is:
with open(FILEMNAME, MODE) as FILEHANDLE:
DO FUN STUFF
FILEHANDLE.write()
OTHER STUFF
...
Note in the example below the “triple quotes” allow us to use a multi-line string.
textlines="""A bit of text I want to write to a file.
1. The first point I want to make is this.
2. Next I want to tell you that..."""
fname="outfile.txt"
with open(fname,"w") as f:
f.write(textlines)
Just to prove that the file has been written and closed properly, read its contents, which we will cover in the next subsection
print(open(fname,"r").read())
The write
method can be used repeatedly to print more output to the file, as
shown in lines 2, 3 and 4 below.
In bigger programs, lines 2-4 will usually be replaced by a loop that writes many more lines into the file.
with open("test.txt", "w") as mynewfile:
mynewfile.write("# A bit of text I want to write to a file.")
mynewfile.write("1. The first point I want to make is this.")
mynewfile.write("2. Next I want to tell you that...")
Now look in your working folder and you will be able to open the new file test.txt.
You will notice that the text is all on one line. This is because we didn’t tell Python to start a new line.
We do this using the special character
"\n"
(backslash n), which stands for “newline”.
This is like pressing enter or return on the keyboard. Other special characters exist, such as
"\t"
for tab.
Exercise: Try again, but putting a
\n
inside the end of each string
#with opener
# YOUR CODE HERE
# YOUR CODE HERE
# YOUR CODE HERE
# YOUR CODE HERE
Now reload the file in a browser to see it.
It should look like this:
# A bit of text I want to write to a file.
1. The first point I want to make is this.
2. Next I want to tell you that...
You will need the correct code run above for the following exercises to work!
Reading from Files¶
If a file exists on our disk, we can open it for reading.
This time, the mode argument is "r"
for read:
thefile = open("test.txt", "r")
print(thefile)
This returns the contents of the file in a “container” object.
However, if we try to open a file for reading that doesn’t exist, we get an error:
anotherfile = open("iamnotafile.txt", "r")
We can write some code to catch errors and prevent them from raising an error message (a so called “exception”).
# This checks that the expected error is raised and raises an exception
fname = "iamnotafile.txt"
try: open(fname, "r")
except: print(f'File "{fname}" could not be opened.')
File Methods¶
There are a variety of methods for reading data and text from files, depending on the format of the data and how we want to use it. We can either read the contents of the whole file at once, or scan it in line-by-line.
Reading the Whole File¶
The read
method returns the entire contents of the file, emptying the whole “container” into the string called this
at the same time:
thefile = open("test.txt", "r")
this = thefile.read()
print(this)
If we try to read it a second time, the file container called thefile
is now empty, so will return an empty string:
that = thefile.read()
print(that)
Reading Files a Line at a Time¶
Another file method is readline
, which scans the contents line-by-line:
f = open("test.txt", "r")
print(0, f.readline()) # This will read the first line of the file.
print(1, f.readline()) # and next the second line
print(2, f.readline()) # and the last line
print(3, f.readline()) # the handle (container) is now empty...
The end=""
argument prevents an extra newline being added:
f = open("test.txt", "r")
print(0, f.readline(), end="") # This will read the first line of the file.
print(1, f.readline(), end="") # and next the second line
print(2, f.readline(), end="") # and the last line
print(3, f.readline(), end="") # the handle (container) is now empty...
Iterative Methods for Scanning Files¶
A file can also be iterated over in the same way as a list.
Exercise: Put the previous code in a loop using
for i in range(3):
and changing the numbers for the counteri
# open the file
# YOUR CODE HERE
#FOR loop condition:
# YOUR CODE HERE
#single print function
# YOUR CODE HERE
#end of loop
Alternatively you can just iterate over the file contents directly:
the_file = open("test.txt", "r")
for each_line in the_file:
print(each_line, end="")
Reading a File in to a List of Lines¶
It is often useful to fetch data from a disk file and turn it into a
list of lines.
We can then perform useful tasks on this list.
The readlines
method in line 2 reads all the lines and returns a list
of the strings.
f = open("test.txt", "r")
list_of_lines = f.readlines()
print(list_of_lines)
print("The last line is:\n", list_of_lines[-1])
Sorting a File¶
This example sorts the lines of a file alphabetically (with capitalised words first).
We will use the friends.txt
file we downloaded, which has a name per line.
Take a look at this file using a normal text editor.
First we read everything into a list of lines, then sort the list, and then write the sorted list back to another file:
thefile = open("Files/friends.txt", "r")
contentlist = thefile.readlines()
thefile.close()
print(contentlist)
Now we can sort the list alphanumerically:
contentlist.sort() # sort is a list method, guess what it does before running this cell
list(contentlist) # spot the difference!
Now write back to another file:
with open("sortedfriends.txt", "w") as outfile:
for next_entry in contentlist:
outfile.write(next_entry)
Open the file using the browser to see its contents.
Exercise: File Reversing¶
Write a program that reads a file and writes out a new file with the lines in reversed order (i.e. the first line in the old file becomes the last one in the new file.)
Hint: use reverse indexing using:
range(-1,-N,-1)
#open the input file
# YOUR CODE HERE
# read into a list (called contents) using readlines
# YOUR CODE HERE
N = len(contents) # may be needed later...
# open your new file to write to, using a WITH statement:
# YOUR CODE HERE
## EVERYTHING INDENTED IN THE WITH BLOCK ##
# loop backwards FOR the LENgth of the contents list obtained above
# YOUR CODE HERE
# write lines back to your file
# YOUR CODE HERE
## END OF THE WITH BLOCK ##
Filtering a File¶
Many useful line-processing programs will read a text file
line-at-a-time and do some minor processing as they write the lines to
an output file. They might number the lines in the output file, or
insert extra blank lines after every 60 lines to make it convenient for
printing on sheets of paper, or extract some specific columns only from
each line in the source file, or only print lines that contain a
specific substring.
We call this kind of program a filter.
Here is a filter that copies one file to another, omitting any lines
that begin with #
.
Have a look at the contents of the file
Files/intext.txt
to see what it looks like before running the next cell.
oldfilename = "Files/intext.txt"
newfilename = "outtext.txt"
infile = open(oldfilename, "r")
with open(newfilename, "w") as outfile:
for text in infile:
if text[0] != "#":
outfile.write(text)
outfile.close()
Look at
outtext.txt
to see what this did to the data.
Methods such as sorting and filtering are also very useful on numerical data.
Manipulating Numerical Data¶
Numerical data can be read in, processed and written to files in the same way as text.
Instead of using text methods we simply perform mathematical operation on the numbers.
Import of data using NumPy¶
Numerical data manipulation is made easier by importing and using the numerical module NumPy.
NumPy also has special methods for saving and loading purely numerical array data to and from files.
The following programme loads a column of text from the file "temps.txt"
into an array using the ARRAY=np.loadtxt(FILENAME)
method.
import numpy as np
data = np.loadtxt("Files/temps.txt")
print(data)
We can then take the mean (average) and standard deviation of the data using the .mean()
function:
help(np.mean)
m = data.mean()
s = data.std()
print(f"mean = {m:.1f}, standard deviation = {s :.1f}")
Export of Data using Numpy¶
Continuing the last example we will now:
Calculate the the difference of the individual temperature values from the mean.
Save these to a new file using the
np.savetxt(FILENAME, DATA)
method.
devs = data-m # a new array with the differences
print(devs) #print the array
np.savetxt("deviations.txt", devs)
Look at the contents of the file from the file browser.
Manipulating CSV files¶
Numerical columns of data are commonly sored as Comma Separated Value files with a .csv
extension
The file Files/weatherdata.csv
contains hourly weather data in plain text, with values separated by commas (or other so called “delimiters”).
The file looks like this in a plain text editor:
Dry Bulb Temperature {C}, Dew Point Temperature {C}, Relative Humidity {%}, ...
2.50E+0, 1.2, 91, ...
5.70E+0, 3.5, 86, ...
7.90E+0, 4.7, 80, ...
... , ... , ...,
But when opened in Excel appears as a table like this:
Dry Bulb Temperature {C} |
Dew Point Temperature {C} |
Relative Humidity {%} |
… |
---|---|---|---|
2.5 |
1.2 |
91 |
… |
5.7 |
3.5 |
86 |
… |
7.9 |
4.7 |
80 |
… |
8.7 |
5.6 |
81 |
… |
8.9 |
5.8 |
81 |
… |
… |
… |
… |
… |
The np.loadtxt()
function method can read the values into a data array by telling .loadtxt
how the data is laid out in the .csv
file.
The keyword argument
delimiter=<SOME_STRING>
tells numpy how the data is separated (without this option it assumes a space).The keyword argument
skiprows=1
is used to ignore the first row, which is the non-numerical header.
import numpy as np
# delimiter=',' tells .loadtxt that the values are separated with commas (rather than spaces)
filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)
print(filedata) #will show only the head (top rows) and tail (bottom rows) of a long array
Each row is an hour’s weather data, with the temperatures in the first column.
Slicing data from an array¶
Numpy arrays can be sliced using ARRAYNAME[<STARTROW>:<ENDROW>, <STARTCOL>:<ENDCOL>]
for example:
Taking the value in the 1st row (
i=0
) and third column (j=2
):
filedata[0, 2]
91.0
Taking the value in the second row (
i=1
), from the second (j=1
) to third (j=2
) column:
filedata[1, 1:3] #note the end position is not included
array([ 3.5, 86. ])
Values before the fourth column (
j=3
) in the last row:
filedata[-1, :3]
array([ 4.3, 2. , 85. ])
the entire second column (
i=1
):
filedata[:, 1]
array([1.2, 3.5, 4.7, ..., 3.4, 2.8, 2. ])
Note that an empty value in a
A:B
specifier takes value to the end of the row or column.
Example: Calculating the average temperature:¶
import numpy as np
filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)
temperatures = filedata[:, 0] # take the first (temperature) column
average = temperatures.mean()
print(f"Average temperature: {average:.2f} degrees C")
Saving Numerical Only Data¶
We can instead write the output to a file using np.savetxt(FILENAME, DATA, <keyword=options>)
using the keyword options:
delimiter = ","
to use commas in the.csv
file (if there are multiple columns)fmt="%.2f"
to format the numbers asfloats
with 2 decimal places (similar to theformat
method above).
np.savetxt("temperatures.csv", temperatures, delimiter = ",", fmt="%.1f")
The file contents look like this:
2.5
5.7
7.9
.
.
.
Download it and open it in Excel
Exercise: Deviations from the average.¶
Load the weather data from the file.
Slice out the first column of temperatures.
Take the average value of the temperatures.
Subtract the average value from the temperatures array to give the deviations (\(d = T - \mu(T)\))
Save this back to a file called
"deviations.csv"
with1dp
floating point precision
import numpy as np
# load the weather file
# YOUR CODE HERE
# slice column 0
# YOUR CODE HERE
# obtain the mean
# YOUR CODE HERE
# calculate the deviations
# YOUR CODE HERE
# save back to a CSV text file
# YOUR CODE HERE
Opening the file should have the contents:
-7.7
-4.5
-2.3
...
Processing Multiple Files¶
Scripts can allow you to process many files in one go. You can split up a single file into many, join data from lots of files into one place or plot data to a range of figures in one go.
Using the os
module.¶
A nice module for working with our Operating System is the os
module.
This allows us to see/change our current working location as well as make new folders.
To view your current working directory use the .getcwd
function method:
import os
myWD = os.getcwd()
print(myWD)
The method .listdir(FOLDER)
allows us to list the contents of a directory (FOLDER).
Note that when referring to the current folder we can use the string
"."
and the one above we can use".."
Try the following command:
contents = os.listdir(myWD) # lists the contents of the working directory
print(contents)
os.listdir("Files") # this will only work if the `Files` folder exists
the
os.mkdir(<FOLDERNAME>)
tries to create a new directory named whatever string you replace<FOLDERNAME>
with.We use
try:
andexcept:
to catch and ignore any errors such as the folder already existing.
Weekly Weather files:¶
import numpy as np
import os
filedata = np.loadtxt("Files/weatherdata.csv", delimiter=",", skiprows=1)
temperatures = filedata[:, 0]
folder = "Weather/"
try: os.mkdir(folder) # try to make the new folder if it doesn't exist
except: pass # if the folder already exists move on
hours_per_week = 24*7
# count for 52 weeks:
for i in range(52):
start_hour = i*hours_per_week #takes values: 0, hours_per_week, 2*hours_per_week, ...
end_hour = start_hour + hours_per_week
weekly_temperatures = temperatures[start_hour:end_hour] #slice out the hours for that week
# make a two digit week number 01, 02, 03, ..., 50, 51, 52
week_number = i+1
week_string = str(week_number).zfill(2) # fill with leading zeros to make all two digits
# create a new file
newfilename = f"temp_week{week_string}.txt"
filepath = folder+newfilename
np.savetxt(filepath, weekly_temperatures, fmt="%.1f")
The Weather folder should now contain 52 files, each with a week’s worth of hourly temperature data as a single column.
wfiles = os.listdir("Weather")
print(wfiles)
Task 7: File Manipulation (2%)¶
For this task you will read in a CSV energy file and produce a new file of daily totals.
Use previous examples to guide you in solving the various parts of solving this problem.
Take it step by step. Do the first step and check it works before doing anything else and so on.
Use lots of
print()
functions when developing your code, but remove them in the final version.
Task: Energy Data¶
The data in the file: Files/houseenergy.csv
is in the following format:
month |
day |
hour |
elec |
gas |
---|---|---|---|---|
1 |
1 |
1 |
0 |
0.746 |
1 |
1 |
2 |
0 |
0.672 |
1 |
1 |
3 |
0 |
0.075 |
For the task you must do the following:
Load the numerical data into a Numpy array
Make sure you load the file from a folder called
Files
with an uppercaseF
or it will fail on the grading server.
Slice out the columns for Electricity and Gas
Add them together to work out the total energy each hour
Open a file to write the new data to
Work out the hourly totals for each day (use a similar method as in the weekly weather example)
Sum them to give the total energy for that day
Write this sum to a line of the outfile inside the for loop, using syntax like:
outfile = ??? # open the outfile to write
for day in range(365):
start_hour = ???
end_hour = ???
daily_energy = ???
??? # write the daily_energy to the outfile
??? # close the outfile
Or alternatively:
???: # use the WITH method to open the outfile
for day in range(365):
start_hour = ???
end_hour = ???
daily_energy = ???
??? # write the daily_energy to the outfile
You will need to format the lines properly to have a new line after each value.
The plain text file "energy_totals.txt"
should contain simple numerical lines like:
81.121
79.466
108.238
...
Note: this is rounded to 3 d.p.
There should be nothing but a column of numbers
no commas, no text, no units…
Hint: you will need to use the newline specifier “\n” when writing lines individually.
import numpy as np
# Import file to data array
# YOUR CODE HERE
# Slice out the columns for Electricity and Gas
# YOUR CODE HERE
# Add them together to work out the total energy each hour
# YOUR CODE HERE
# open an outfile
# for each day of the year
# work out the hourly totals for each day
# sum them to get the daily total
# writing each day's single total one per line
# YOUR CODE HERE
The script below checks if your code has produced the desired output file and its contents are as expected.¶
import sys
sys.path.append(".checks")
import check07
try: student_file = np.loadtxt("energy_totals.txt")
except OSError as e: print(e,
"\nDid you save the data to the correct file in the current working folder?")
except: pass
check07.test()
Extra Example: Plotting Data from a Set of Files¶
1. Plotting a figure for each file¶
study the code below and try to understand what is happening on each line.
import os, matplotlib.pyplot as plt, numpy as np
dirname = "Weather/"
file_list = os.listdir(dirname)
try: os.mkdir("WeatherFigs")
except: pass
fig,ax = plt.subplots(figsize=(10,5))
for file_name in sorted(file_list):
data = np.loadtxt(dirname+file_name)
ax.plot(data)
ax.set_xlabel("Time (h)")
ax.set_ylabel("Temperature ($^\circ$C)")
figname = file_name.replace(".txt", ".png")
fig.savefig("WeatherFigs/"+figname)
ax.cla() # clear the axis content to start a new graph
Now look in the new
WeatherFigs
folder to see the image files.
2. Collecting data from many files into one figure¶
First we will get the data from all the files created earlier and put each week’s average into a new list (the example continues below):
import os
import numpy as np
dirname = "Weather/"
file_list = os.listdir(dirname)
avdata = []
# sort the file list so the weeks are in order (needed later)
for file_name in sorted(file_list):
weekdata = np.loadtxt(dirname+file_name)
av = np.mean(weekdata)
avdata.append(av)
print(avdata)
Continued… Next we:
Calculate the average over the whole year (the average of all week’s averages).
Loop through each week looking if they are above or below the yearly average, and then
Plot them as red points if hotter and blue points if colder.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,5))
yearav = np.mean(avdata)
wnum = 0 # counter for week number
for weekav in avdata:
wnum = wnum + 1
if weekav >= yearav:
#plot in red
ax.plot(wnum, weekav, "ro")
else:
#plot in blue
ax.plot(wnum, weekav, "bo")
#add some formatting and an average line then save figure
textstring = f"mean = {yearav:.1f}$^\circ$C"
ax.text(1, yearav+0.2, textstring, size=14)
xpts = [0, 52] # two points for x values for the average line
ypts = [yearav, yearav] # two points for y values for the average
ax.plot(xpts, ypts, "g-") # plot the average line in green
ax.axis([0, 52, 0, 20])
ax.set_xlabel("Week Number")
ax.set_ylabel("Temperatures (Deg C)")
fig.savefig("weekly_temperatures.png")
fig.show()