Workshop 3

Your assignment

A healthworker has been accused of deliberately spreading multi-drug resistant Staphylococcus aureus through a ward in a general hospital. Three patients who were infected with MRSA are suing the Healthcare Trust for substantial damages and they have turned to you for help.

Using the skills you have learnt in the earlier workshops, decide whether you think the strain of MRSA found on the healthworker is the same as that isolated from the patients. The marker protein we have used is Clumping factor A (or ClfA) which shows significant variability between strains. The DNA and protein sequences of ClfA from the three cases are given below:

ClfA protein sequence	ClfA DNA sequence
Healthworker	Healthworker
Patient A	Patient A
Patient B	Patient B
Patient C	Patient C

Having constructed your tree, submit it as part of a 1 page report (the Healthcare Trust are too busy to read anything longer), roughly one third of the page should be your introduction&methods, one third the results (the tree and associated legend), and the final third the discussion. As with the previous workshops, its the activity of producing your own tree that is the important part, so the report should not be long.

Hints:

Do not use Internet Explorer or Edge.
Find 12 to 16 more S. aureus clumping factor protein and DNA sequences - this is instead of sampling the community for local S. aureus strains.

Search the databases by protein sequence as you want to find sequences similar to those of the healthworker and the patients, so use BLAST .
You will need to do 4 searches, one for each of the healthworker and the 3 patients, using the ClfA protein sequences above.
You will need to choose 3 to 4 sequences from each search but do not choose just the top 4 for each search, but a selection of those with the lowest E-value (this may be 0.0) and longest Query coverage to 'sample the community'.

Once you have chosen a protein sequence, click on the Accession code on the right-hand-side to take you to the Genpept page for that sequence:

change its format to FASTA via the link at the top of the page, and save it into a Notepad (or Wordpad or TextEdit) plain text file.
Find the corresponding DNA sequence. The simplest way is via a CDS link if you go back to the protein Genbank file (usually just above the protein sequence as shown here), then by clicking on the FASTA link that appears at the bottom of the screen.
If your chosen sequence doesn't have a CDS link, it should have a Nucleotide link on the righthand side of the page as shown below.

The Nucleotide option may link to a number of sequences, if it does, choose the top one unless you have chosen it before.
If your chosen link takes you to a complete genome sequence, but without any sequences visible, use the Customize view option on the right-hand side to Customize the view to show Gene, RNA and CDS features only as shown below.

Once you have some annotation to search, use your web browser to find "clumping factor A" in the annotation and click the CDS link on the left-hand-side (at the top of this image) to extract just the portion that corresponds to ClfA

Either you will be taken to the genbank entry for the sequence and may need to change this to FASTA to save it, or if a pop-up appears on the right-hand-side, (as shown below) click on FASTA just below it to get to the corresponding nucleotide sequence

However you arrive at the nucleotide sequence, save it in FASTA format (with its header line) to a file. Do not worry if the title still says whole genome, you can see by the length of the FASTA DNA that it is only a small part of the genome, (the portion between the bases in the box on the top right), the section that corresponds to Clumping Factor A.

Some sequences may be reverse transcribed. If your protein sequence starts with M, check that your DNA sequence starts with ATG, and if it doesn't, check whether it ends with CAT (like the one below). If so, you probably need to obtain the reverse complement of your sequence using a tool such as this. Put your sequence into the box and it will reverse complement it. Put this sequence into your growing nucleotide file rather than the original, although you should use the original header.

Put all the protein sequences (retaining their header lines too) in FASTA format into one file, and all the DNA sequences (also in FASTA format, together with their header lines) in another, making sure you keep the protein and DNA sequences in the same order and that the files are in plain text format.
Once you have your complete sequence selection, make sure you have no duplicates, (this is easiest to do by looking for duplicates of the header lines). If you do, just delete them (from both files), you only need 6 to 8 well varied ones, together with those of the Healthworker and 3 patients to draw the tree.

Make a multiple protein sequence alignment using T-coffee.

When using Tcoffee, it helps to keep track of the correspondence of protein to nucleotide sequences if you set the output order to INPUT so that the aligned sequences remain in the same order.
Remember to include the Healthworker and Patient protein sequences that are given at the top of this page, or they will not appear on the tree.
If T-coffee claims there are 0 sequences to align, go back and try to align only 2. If that works, add the others back one at a time until you find the one that doesn't work. Remove that from both the protein and DNA files.

Overlay the nucleotide sequences on the protein sequence alignment using RevTrans as in workshop 2.

no warnings should be printed to the screen, any errors indicate that the particular pair of protein and DNA sequences don't correspond to one another.
If you get an error message about duplicates, check that you don't have any duplicate sequences. If some of your sequences are short, it sometimes gets confused, but if you move all the short ones to the end this often helps.
Remember to output your aligned nucleotides in FASTA format.

Make a phylogenetic tree from the nucleotide sequences.

You will have already downloaded SeaView in the previous workshop, so you don't need to do that again, just find the folder and run the executable there.
If seaview complains about (),:; in the sequence names (as below), open the aligned nucleotide file and replace the : by a _ or similar. Then read the file in again.

Only one tree is needed