Workshop 2

Once sequences have been extracted from databases by the methods we used in workshop 1, these need to be related to one another to extract evolutionary, structural, or functional information. Initially we’ll look at the simplest case, how two sequences are aligned, or pairwise alignment. Multiple sequence alignment often starts with pairwise alignment of all the sequences to identify the most similar pair, to which all the others are aligned in turn. We will look at ways of displaying and editing sequence alignments. Finally a multiple sequence alignment will be used in the production of phylogenetic trees.

Sequence alignments and phylogenetic trees.

During this workshop you will be asked to save a number of program outputs. To help you keep track the output expected for each section is listed below.

Steps Expected output
1. Sequence retrieval list of sequences chosen
2. Pairwise sequence alignment alignment statistics
3. Multiple sequence alignment multiple alignment output
4. Alignment editing (none)
5. Alignment presentation MView and ESPript  printouts
6. Nucleotide sequence alignment (none)
7. Tree building two trees

As you work through this script you'll find a number of questions to answer, these are coloured in reverse so you'll notice them. The output images should be collated together with the answers to questions posed in this script for your submission.

1. Sequence retrieval

Last session we searched annotation at the European Bioinformatics Institute EBI, this time we’ll use the NCBI annotation search engine so you can see how similar they are to use. Click on http://www.ncbi.nlm.nih.gov

When you have arrived, enter ‘d-lactate dehydrogenase’ in the box and press Search . Results are obtained in most databases searched, but we’re interested in Proteins.

Click on the Protein link within this Proteins section,for which there about 150,000 hits.

On the next page, select using the square buttons on the left, five or six sequences. Each should be from a different species (Click on the hyperlinks if it isn’t apparent from the name) and avoid those that are precursors, or “hypothetical”, “predicted”, “probable”or only “dehydrogenase-like” as these functions have been derived from sequence comparison not experiment. It would make for a more interesting alignment if they were not all the same (or similar) length.

Note down the Accession code, genbank identifier (which starts gi| ) and organisms for the sequences that you choose.

You may need to click on the entry to find the organism name (on the line that starts ORGANISM).

Having selected your sequences, at the top (or bottom) of the page change the Summary v to FASTA .

summary link

Then click Send to on the right hand side and in the Choose Destination pulldown select File and save to your network drive as my.fasta. If asked whether you want to open or save the file, choose Save As from the Save pulldown menu. Note that you may need to allow Internet Explorer to show pop-up windows.

This file can be opened in Notepad or Wordpad (do not use Word to open these files if you want to use them in a website later).

2. Pairwise alignment

First we’ll look at aligning two of these sequences. In local pairwise alignment the two sequences are searched for the region of highest similarity, useful for looking for motifs in sequences, whereas global alignment seeks to find the optimal overall alignment. We will look at both of these.

Go to the European Bioinformatics Institute http://www.ebi.ac.uk/ and click on the Services header at the top.

The Institute provides an array of programs across Bioinformatics. Type alignment into the Find a data resource or a tool search box to list the various options for Sequence Alignment.

We want to do pairwise local alignment of proteins so choose EMBOSS water .

Once in the EMBOSS Water window, copy two of the sequences from your Entrez window into the two sequence boxes, (preferably two dissimilar in length), then Submit the sequences. (If there is a problem, try removing any gaps in the header line).

The programme reports the alignment of highest score between the two sequences. Write down the length of the alignment, % identity, % similiarity, the number of gaps in the alignment and the score.

Alignment parameters

When two sequences are compared during the alignment process, it is more favourable for an amino acid to have been mutated to one with similar properties, e.g. two hydrophobic residues, then for a hydrophobic residue to have been substituted by a polar residue. This reasoning has produced a number of scoring matrices, or tables of the probability of each amino acid having been substituted for any of the other 19. The score at each position is then incorporated into the final score for the alignment of the total sequence.

Return to the EMBOSS Water page. The parameters that can be changed are accessed under the Parameters section by clicking on the More options... link. Here it shows that we are using the BLOSUM62 Matrix.

Introduction of a gap into a sequence during its alignment incurs a penalty as this implies a sequence deletion, an evolutionary event. Once a gap is opened, extending it just implies a longer deletion so the GAP EXTEND penalty is lower than the GAP OPEN .

Now change the GAP OPEN penalty to 20 and the GAP EXTEND penalty to 5. Re-run the alignment and write down the length of alignment, % identity, % similarity, score and number of gaps.

Describe what happens when the gap penalties are increased.

3. Multiple Sequence Alignment

Usually we have more than two related sequences extracted from databases and we need to align all of them to identify conserved residues etc. However finding the globally optimal alignment is computationally challenging, so many programs do a progressive alignment of sequences starting with the pair that are most alike. Programs are optimised for the size of alignment.

You can judge which program to use at the EBI website https://www.ebi.ac.uk/jdispatcher/msa. As we have only a small alignment we will use T-Coffee.

Click on the Launch T-Coffee link on the msa page.

Copy and Paste each sequence either from the saved sequence file or from the NCBI Sequence Viewer window (section 1) into the box in this T-Coffee window, including their header lines which begin something like:

>gi|126036|sp|P26297|LDHD_LACDE ………

Sometimes Internet Explorer or Edge don't work at this point. If this is the case, use Firefox, Chrome or another browser.

Scroll back up the box to check that the title line of each sequence is separated from the previous sequence.

Now look at the parameters being used for the alignment by clicking on the More options... button within the Parameters section. Here are several parameters used by the program, however you could change the output ORDER. The default is to output the sequences grouped by similarity, but if you have arranged them in a particular order, you might want to change this to input

Press Submit .

When the results appear, you will see the alignment of your sequences coloured by amino acid type. click on the Tool Output tab where you can see the whole alignment in short blocks. Download the alignment, renaming it to my.aln and if you are asked, Save as type All Files (not a web file format).

Include the aln file in your submission (use Courier font and a small enough font size that the lines aren’t wrapped).

4. Editing alignments - for information only, not directly available here.

There are many tools available to do this including an applet version of JalView available on the ebi website, however you can't use it without installing the software on your computer, so I'm including this section for completeness but you can ignore this section for the purposes of the workshop.

With Jalview you can:

  • change the order of the sequences within the alignment (click on a sequence,then move it up or down with the cursor keys)
  • add or remove gaps from your sequence alignment, (press F2 to enter “keyboard mode” then X<space> (i.e. a number followed by the space bar) inserts X gaps at the cursor position, X<delete> deletes X gaps).
  • edit the sequence names using a menu brought up with the right mouse button. Choose the lines with the sequence identifying code on it, then the Edit name/description option. More readily understandable names would be useful if you were going to use these alignments in a phylogenetic tree for instance.
  • colour the alignment by a number of different criteria.
  • which might be useful in presenting sequence alignments in the most appropriate manner for a research project

    5. Alignment Presentation

    You will use two methods for presenting alignments, emphasising sequence similarity according to different criteria.

    The second allows annotation of sequence alignments by secondary structure (if the structure of the protein of interest has been determined) and more user customisation.

    i) MView

    The MView program, as provided by EBI, is able to colour previously-calculated multiple sequence alignments by sequence identity and will allow you to change the layout for better printing.

    If you still have your multiple sequence alignment browser window, click on Send to MView on the Results Viewers tab. Otherwise, in a new browser window click on http://www.ebi.ac.uk/Tools/msa/mview .

    Here copy your complete alignment file into the box, or upload the file that you saved. Leave the Parameters as AUTOMATIC.

    For now just press Submit .


    The result shows the amino acids coloured by their properties, e.g. polar residues are light blue with white text. Below the alignment are the consensus sequences, highlighting areas of sequence similarity in all the sequences, calculated at different percentage cutoffs.

    The default layout is for an alignment 80 characters wide, but this doesn't make optimal use of the space. Also the many lines of Consensus are distracting.

    Return to the Mview input window and click on More options under Parameters.

    Here you could change the output format of the alignment if you need to for subsequent analysis packages, whilst on the second row you can change how the alignment will look on the page. On the third row, Colormap gives options for highlighting different physical properties of the amino acids, while under Consensus you can toggle the presence or absence of a 'Consensus' sequence underneath the alignment.

    Choose a COLORMAP by which to colour the sequence .

    Change the ALIGNMENT WIDTH to 100 and the CONSENSUS to OFF and re- SUBMIT.

    Copy the resulting alignment into your submission file in Word, noting the colormap used. (You may need to change the size of the window or reduce the zoom to capture it all in one print-screen)

    ii) ESPript

    This is a more powerful display utility, but as with many bioinformatics websites, it doesn't work well with Internet Explorer or Edge.

    Click on http://espript.ibcp.fr/

    As the illustration shows, if you know the structure of one of the sequences, you can display secondary structure at the top of your alignment. As well as the shading of similar residues carried out as in Boxshade, areas of similarity are surrounded by a blue box.

    Click on Run ESPRIPT.

    In the Aligned sequences section, click Browse to search for the my.aln file.

    If we knew the structure for one of our proteins we could annotate the sequence alignment with the secondary structure by uploading the co-ordinate file in the Secondary structure depiction box, and in the Sequence similiarities box the scoring function could be changed.

    Change the Output layout to Landscape with 8 point font and 100 columns (Col). By default the output will be as postscript and pdf files, but you can specify other formats.

    In the upper bar click on SUBMIT.

    It may be that your browser will block the opening of the Results window, and you will need to allow it to open pop-ups on your screen. This may be as simple as clicking on the notification bar and select Always allow Pop-Ups from this site. To do this in advance, in Chrome you might select Tools then Pop-up Blocker then Always allow pop-ups from this site. Alternatively, in Firefox select Tools then Options. Click on Exceptions next to Block popup windows. Type espript.ibcp.fr into the box on the next window and Allow . For Safari, select settings from the drop-down menu on the safari tab at the top of the screen. On the new tab, click on websites and scroll down to pop-up windows under the 'general' sub-heading. From the view of the websites you have open, click on espript.ibcp.fr and change the preferences from block and notify, to allow.

    It is then best to start ESPRIPT afresh with a new tab.

    When successful, a pdf link will appear in the Results window, clicking on which should show you your output. If this looks reasonable, return to the Espript window and select another output format such as png or tiff which will enable you to include it in your submission file.

    6. Nucleotide Sequence Alignment

    Rather than protein sequences, it is better to use DNA sequences for phylogenetics because protein sequences haven't got information on silent nucleotide mutations. For successful molecular phylogenetic analysis, the sequences must be well aligned. However in general, it is better to use protein sequences for multiple alignments, because nucleotide sequences have only four bases, making many alignment possibilities and confusing the alignment program. So the strategy for making aligned nucleotide sequences is:

    (i) obtain the nucleotide and corresponding protein sequences.

    (ii) align the protein sequences

    (iii) take the original nucleotide sequences and overlay them onto the protein sequence alignment

    Obtaining all these sequences is too time-consuming for this workshop, so two sample files (unaligned nucleotide, and aligned protein) have been prepared for you.

    (i)The protein sequences were downloaded into one file, in FASTA format, as we did in step 1, then aligned by T-coffee and saved to myprotaln.fasta.

    (ii) The nucleotide sequences corresponding to each of the protein sequences were downloaded by following the links from the protein pages, and saved into a file mynuc.fasta,

    Aligning DNA sequences to a protein alignment with RevTrans

    We will use RevTrans to do the DNA alignment. Again this does not work in IE or Edge, try Chrome/Firefox. http://services.healthcare.dtu.dk/services/RevTrans/ .

    Copy the nucleotide and aligned protein sequence from the links above into the appropriate boxes on the RevTrans website. Again, copying from Internet Explorer or Edge often doesn't work, try Chrome or another browser.

    Click on Submit query .

    When the results page appears, click on Download alignment in FASTA format and either right click and 'Save Page as', making sure you save it with type All Files or copy the resulting frame into a Notepad text file, saving the subsequent file as a Text file to a suitable folder as e.g. mynucalign.fasta

    Download the phylogenetics software seaview

    To make the phylogenetic trees we will use the molecular phylogeny program SeaView

    This can be downloaded from http://doua.prabi.fr/software/seaview.

    Click on the appropriate download for your computer, selecting Save when asked. Once downloaded, if you are offered the chance to Run the programme, extract it to a suitable location on your computer or university H: drive.

    Run the program from that folder.

    Drawing phylogenetic trees

    From the File menu, use Open FASTA to open the aligned nucleotide file you saved earlier.

    The alignment should appear in the window. Take a screenshot of this and consider whether the alignment produced by this process appears sensible.

    There are three methods for Tree Building offered by SeaView.

  • First, Parsimony which uses PHYLIP to return the consensus of the most parsimonious tree found.
  • Second, Distance methods . Distance methods use attributes of the sequences to calculate the evolutionary distance between all pairs of sequences in the set. This relies on a definition of the mechanism of evolution i.e. a model for the mutation of nucleotides over time.
  • Third, the method of PhyML (Maximum Likelihood) is computer intensive, but uses the highly unbiased Bayesian statistics to get a more reliable result. This requires another download, so we won't be using it today
  • Other terms you will encounter include:

    Bootstrapping which involves randomly resampling the data used to create the tree, in order to place confidence limits on the position of each subtree. 86% means that in 86 out of 100 resamples, the subtree fell in the same place.

    Jumble is used in Maximum Parsimony and Maximum Likelihood, because these methods are dependent on the order that the sequences are listed. Jumble causes the calculation to be carried out several times in different orders, and chooses the best resulting tree.

    In tree pictures, the distance between the branches mean nothing, it is the distance along the branches which describes the evolutionary distance between the sequences. The distance between any two sequences is the sum of the branch lengths connecting them, given in arbitrary units.

    From the Trees menu, select Select Parsimony . In the new menu select Bootstrap then click on OK.

    In the new tree window, click on Br(anch) lengths and Br(anch) support so they are shown on the tree. Note that you might need to increase the size of the window to reduce overlap of the numbers.

    Save the result (either <Alt><Print Screen> and then paste into your word document, or save using the File menu to a PDF).

    Try a different method (under Distance methods) for calculating a new tree and compare results.

    Hand in both the trees and comment on the similarities or differences between them.

    Having practiced sequence alignments and the production of phylogenetic trees, we will be using these methods in a problem next week