Revision session

In this exercise we are revisiting some of the techniques used in the earlier workshops, and expand these to include protein structure modelling.

Steps involved in this workshop

1. Find homologous protein sequences

2. Aligning protein sequences

3. Editing sequence alignments

4. Displaying alignments with ESPript

5. Finding homologous structures and 3D structure modelling

1. Finding similar sequences

Starting with a protein sequence, you can find similar sequences either by searching by name using Entrez as we did in workshop 2, or by searching with a sequence using a program such as BLAST which we did in workshop 1 and will repeat here.

Our example sequence is that of Bacillus licheniformis nitroreductase:

>B.licheniformis nitroreductase
mteqskkqeildafqfrhatkefdpdrkisdedfqfileagrlspssvglepwqfvvvqn
kelreklrqvswgaqgqlptashfvlllgrtakemrrdsgyvadqlkhvkkmpediienm
lkedgvlesfqdgdfhlyesdramfdwvskqtyialanmmtaaaligidscpiegfnydk
vhdilekegvledgrfdisvmaafgyrvkeprpktrraldqivkwve

Go to the BLAST home page : http://blast.ncbi.nlm.nih.gov/Blast.cgi

We are going to search the protein database, this being more specific (20 amino acids) than searching with DNA sequences (4 bases) so choose the protein blast option under Basic BLAST about halfway down the page.

Click in the search box, below the Entry Query Sequence header and Paste your protein sequence into the box.

The box should now look something like this:

Leave the Database choice (under Choose Search Set) as Non-redundant protein sequences (nr) (from all databases).

Click the BLAST! button to search the database for similar sequences.

A new screen appears telling you your protein search has been queued, and the “request ID”. A little later it will also display any conserved domains identified through sequence similarity.

When, at last, the results arrive, at the top of the page is the "conserved domains" chart originally shown on the previous page. Clicking on this chart gives more detailed information in a new window:

BLAST Conserved Domains

The uppermost grey bar represents the sequence you submitted, underneath which the triangles represent important binding residues. Below these, the large red bar shows that the sequence has a high level of similarity with the NfsB-like Nitroreductase family, which we knew. Hovering the mouse over any of the triangles or bars underneath gives more information. Clicking on the bars gives another screen with sequence alignments and trees of related sequences.

Back on the main results page the next section shows a graphical representation of the 100 most similar sequences from the database overlaid on our query sequence.

Blast bar chart

Again the top bar represents the sequence submitted, with the bars below representing matched sequences, coloured by the alignment score, with red being the highest. The length of the bar indicates the region over which there is similarity with the sequence you submitted. In the example above I have hovered my mouse over the top bar to bring its description up in the box at the top.

Below this chart come the names of the 100 similar database sequences (called "hits"), listed in descending scores of similarity.

blast table

The most important score to look at for similarity is the E-value, which describes the goodness-of-fit of a random sequence, relative to the hit. i.e. the lower the E-value, the better the fit of the hit to your sequence, so the top result is the sequence we submitted. A more detailed explanation of E-values is given here

Further down the results page are given the alignments of the database hits against the query sequence.

Choose five or six of the proteins from this bottom part of the results page by clicking in the boxes to left of the sequence identifier, i.e.

< p>Choose sequences with a variety of E-values for a more interesting sequence alignment. Once you have chosen your sequences, go to the bottom of the page and click on Get Selected sequences

A new page appears listing the sequence ID and a short description. Change the format to FASTA in the Display box at the top of the page:
selecting BLAST

to show the sequences. It is easier to copy them to another website if you then change the Send to box to Text to show the sequences without coloured headers.

2. Protein Sequence Alignment

To align these sequences we’re going to use T-Coffee, one of several multiple sequence alignment programs available on the European Bioinformatics Institute www.ebi.ac.uk, a site that offers many useful utilities. (T-Coffee can be found under Proteins on the Services pulldown menu)

Clicking on the T-Coffee listing (on the ebi site above) takes us to a submission form. First put the B.licheniformis sequence at the top (so it is included in the alignment), then copy all the sequences from the NCBI BLAST page into the box at the bottom of the T-Coffee page and hit Run.

After a while a Results page appears. You can save the alignment you've created by right-clicking on the the hyperlink to the right of the Alignment File e.g.

alignment link

and then on Save Target as. In the next popup window change the name to e.g. nitrate_reductase.aln, the file type to All Files, and save it to your H: drive.

3. Editing the sequence alignment

An applet version of JalView is available on the T-Coffee results page. Click on the Start JalView box in the "Results of search" table.

This is a fairly limited editor, but can be used for minor edits such as changing the sequence names using a menu brought up with the right mouse button. Choose the lines with the sequence identifing code on it, then the Edit name/description option. More readily understandable names would be useful if you were going to use these alignments in a phylogenetic tree for instance.

4. Displaying alignments with ESPript

The www site for this program is http://espript.ibcp.fr/ESPript/ESPript/. Here click on Execute

At the top of the site it warns you to allow it to open pop-ups on your screen. To do this, (in Internet Explorer) select Tools, then Pop-Up Blocker, then Pop-Up Blocker Settings. In the Pop-up Blocker Settings window, type the ESPript address (espript.ibcp.fr) into the allowed box then click Add. Now we can continue...

In workshop 2 you were able to annotate your sequence alignments with secondary structure because one of your sequences corresponded to a known structure. You probably haven't any of these this time, so just put your sequence alignment file in the Main Alignment File box.

Click on

on the top toolbar. A link will appear in a new RESULTS window, clicking on the PDF link in which should show you your output.

By default the output is given as a postscript and a pdf file, but at the bottom of the ESPript page you can specify other formats which are more readily inserted into Word documents.

5. Finding structural information on the sequence.

(i) Search for the most homologous protein for which there is a crystal structure.

This can be done on the BLAST site as we did in workshop 1, however you've asked to use the PDB site itself.

Go to the PDB website. In order to search for structures with similarity to a known sequence you need to click on Sequence above the search box.

Then paste the protein sequence from the top of this page into the box, this time without the header line. Another pulldown menu appears, asking which significance level you would like to use for the search. Choose the most stringent, you can always lower it if you get too few results. Then click the search button.

At the top of the results page are a number of Tabs, catagorising the matches found by a number of different criteria
tabs showing results

The first tab corresponds to the 39 structures found, these are shown on the first results page presented in order of E-value as on the BLAST website. However there are also 13 papers in the literature, and 28 different ligands bound to these structures. The final two tabs group the structures together in terms of their fold, by clicking on either you find out that the majority of the results contain both alpha and beta secondary structure and fall within an NADH oxidase family.

Back on the Structure Hits page, for each similar structure a quick visual indication of the quality of the alignment is given by the bar below the thumbnail sketch of the structure, the longer it is, the better the match, also indicated by the colour. The BLAST statistics and sequence alignments are also given.
blast result
Clicking on the thumbnail structure or the title brings up a page with much more detail, from which the structure can be downloaded or viewed.

ii) Modelling the 3D structure based on similar structures

Now we have a sequence alignment indicating conserved residues and we know that there are stuctures that share significant sequence similarity with the B. licheniformis nitroreductase. The simplest way of modelling a structure where there is significant similarity to those in the PDB is to use Swiss-Model.

The SwissModel web page is broken into three panels, the main one initially providing a overview of the program, the lower left one giving more detailed help on aspects of the program, and the top left providing links to the possible modelling programs. We’re going to use the most simple, (n.b. in a detailed modelling project this would need to be followed with some of the other options). Click on First Approach Mode in the top left panel.

At the top of the frame put in your e-mail address and the title of your modelling session (something meaningful by which you’ll recognise the e-mail!). Then copy the B. licheniformis nitroreductase sequence from the top of this page into the sequence box.

Then Submit Modelling Request.

Another window appears, which tells you that your job has been submitted, then running. and finally the results appear in the window.

The first stage in the modelling was "Template Selection". If you scroll down to this section of the website:

template search

you can see that no hits were found with more than 60% sequence identity, but one (2hayB) was found with 42% identity. This is the B molecule in the pdb entry 2HAY, which was also the top hit in your BLAST search of the PDB which is reassuring.

Just above this section of the website is the Modelling log which gives details of the modelling process, which may not be of interest at this stage in your knowledge of protein structure modelling. More useful may be the Evaluation section, which illustrates graphically which sections of the model may be reliable.

swiss model evaluation

By default the Anolea method is shown, assessing the environment of each residue, green being favourable, red unfavourable. (There is more information about all the plots if you click on the help link links at each section.) Here you can see that the model isn't perfect in several regions.

Just above this is the sequence alignment, a good sequence alignment is crucial to an accurate model. So to improve this model you could try to optimise the alignment in the unfavourable areas of the plot and use your own alignment in another cycle of modelling.

At the top is the model itself, together with some important information, the amount of the sequence that was modelled, the template on which it was modelled and the BLAST statistics for that model. If you click on the picture, a simple viewer will appear, allowing you to visualise the model coloured from blue at the N-terminus to red at the C.

The coordinates of the model can also be downloaded, allowing you to view it in a more powerful viewer such as Rasmol or Deep View (also written by the Swiss Institute of Bioinformatics), in order to look at conserved residues identified from your multiple sequence alignment, or putative active site residues in more detail.

Summary

During these workshops you have used:

BLAST, (PSI-BLAST) and Entrez to obtain sequences of interest
Transeq to translate DNA to protein
Search on the PDB using a sequence to obtain similar structures by name and by sequence (also possible on the protein BLAST page by changing the Database choice under Choose Search Set )
ClustalW to align sequences
Jalview to edit alignments
Boxshade and ESPript to colour the alignments by e.g. identity
Swiss-Model to model protein structure
Rasmol (and other tools on the PDB site) to view protein structures
aligndna to map a protein multiple sequence alignment onto a set of nucleotide sequences
phylo_win to draw phylogenetic trees
GCG for a sequencing project.