Introduction

 

Over the past five decades the use of computers has had a profound effect on research in the biological sciences. The amount of information available to researchers in databases, (whether of sequences, structure, functional information, microarray data, etc. etc., and which may be freely-available or commercial) increases almost exponentially, with biologists and computer scientists coming together to provide Bioinformatics tools to help extract useful information from these databases. The aim of these sessions is to introduce you to the use of some of the information and software resources available in the public domain.

In the first workshop we’ll look at how to search databases for sequences of interest. In the second workshop we’ll align sequences to extract useful information about their phylogenetic relationships. In the final workshop you’ll be given a chance to apply these techniques to a real problem.


Workshop 1: Searching sequence databases

Sequence databases exist for nucleic acids, proteins and complex carbohydrates. For nucleic acids and proteins the chemical structure is represented as a string of characters, such as ACCGTA for nucleic acids or DFGIMCR for proteins. Complex carbohydrates require a more complex representation, due to the alternative chemical linkages and branched structures. In addition, database entries include much more information, or annotation, which contains the biological, bibliographic and administrative context for the sequence.

For nucleic acids, there are three major public domain databases: ENA (European Nucleotide Archive, from EMBL-EBI), NCBI (including GenBank (USA)) and DDBJ (DNA DataBank of Japan), and all exchange information daily, so that they are essentially identical.

These databases can be searched for sequence entries of interest in two ways. It is possible to search the annotation of the individual sequences for key words or phrases, called 'annotation searching' or you can screen all, or part, of the database using your own nucleic acid or protein sequence, 'sequence searching'.

Using GABA transporters as an example, this workshop introduces annotation searching in part 1 and sequence searching in part 2. Finally in part 3 we'll see what information can be extracted from a completed genome database.

As you work through this script you'll find a number of questions to answer, these are coloured in reverse so you'll notice them.  Submit your answers to these to the portal before the deadline. 

1. Annotation searching

Firstly we’re going to use the annotation search hosted by the European Bioinformatics Institute, an outstation of EMBL based at the Wellcome Genome Campus, Hinxton Hall, near Cambridge. To access the EBI, click here : http://www.ebi.ac.uk/.

Many facilities are offered from the Services menu accessed from the Services link at the top of the page. However we are going to use the simple search facility on this page.

Type GABA transporter into the Find a gene, protein or chemical search box and press Search .

This will retrieve all the entries in any of the databases that have the word ‘GABA transporter’ in any part of the annotation.

Looking at the result breakdown on the left-hand side, (be patient, sometimes this menu takes a few seconds to appear), you can see that results have been retrieved from many different databases.

How many nucleotide sequences are retrieved?

Click on Nucleotide sequences . Obviously this a rather large number to do anything with, so we will filter them further. The left-hand menu now includes the option to filter the sequences by organism. To look for only human sequences, select the Homo sapiens box. If Homo sapiens is not visible in the list, then click on the More... below the organism list, type Homo sapiens into the new search window, then Refine .

How many human sequences are retrieved?

This search has retrieved not only coding sequences, but every entry in the databases from genome assembly projects.

How many Coding Sequences are present in total for this search?

Click the Coding (Standard) filter to take you to the standard coding sequence entries in the database.

Choose a sequence (not CAA38484 which we will look at in more detail below), and click on its link which takes you to the European Nucleotide Archive, or ENA. Download the EMBL version of the chosen sequence from the link on the right-hand side and include it in your submission. If you copy it directly from the window into Word (or equivalent), you may need to change the font to COURIER.

Describe the characteristics of this sequence, these might include its function, length, molecule type, the cell or tissue from which the cDNA library was made, developmental stage or any other relevant features. (You don't need to look further than this page or the file you downloaded).


Going back to the Nucleotide Sequences results page, find the CAA38484 entry and click on its link in the database.

From the information just below its code at the top of the page:

What is the name of this gene product?

Click on the various links on the right-hand side to view the sorts of information held on each sequence. For instance, under the Publications section, find out what journal article you would read to find out more?

Under Navigation , what is the code of the protein sequence of the gene product in the UniProtKB database?

From the information under Navigation , click on the link to the protein sequence in the UniProtKB database.

Protein (UniProt) database

Looking at the Function section, What is the Function of the gene product and why do you think it would be of interest? (check the Miscellaneous sub-section).

From the later sections of the page, briefly describe the nature of the polypeptide. Where in the cell would you find it? (Subcellular location section)

What topological information (e.g. topological domains, transmembrane regions) can you identify?

What is the length of the protein (PTM/processing section)?

What Post Translational Modification (PTM) is carried out on this protein, and in which structural domains?

Explain why these would be the expected location

We are going to use this sequence in the next part, (Similarity searches), so you need to save it to your network drive. Scroll down to the Sequence section and click on the v Download link above the sequence to bring up a window with just the protein sequence in FASTA format.

Save this to your network drive as e.g. gaba_transporter.seq in plain text format, i.e. not as a webpage. This could be saved directly from the browser or copied into a Notepad or TextEdit page and saved as plain text.

Structural information

Back on the Uniprot page for the protein, scroll back up to the Structure section. Here are links to both a structure determined using electron microscopy, and an AlphaFold model (based on structures in the database) for this sequence. Click on the link to the electron microscopy structure in PDBe.

In the Experiments and Validation section on this page there is a heat map showing various metrics for the structure.

How reliable is the geometry of this structure?

What is the resolution of the structure?

What level of confidence in inter-atomic distances (e.g. inhibitor-protein interactions) can we have at this resolution? (hint, if you've forgotten/didn't take BB20020 you can revise this here: https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution).

In the Structure Analysis section, click on the Molecule details > link for Chain A.

At the top of the page is basic information about the structure, with which you are familiar, then a graphical display of structural information related to the sequence. Looking at the Chains section, which multi-amino-acid stretches of the sequence are not present in this model?

Given what you know about the topology of this trans-membrane protein, what is the cellular location of the domains that have not been modelled?

Looking at the location of the inhibitor in the structure (white bonds in the diagram to the right), are these missing regions likely to be important in the function of the transporter?

As the first structure of a human GABA transporter, this is a useful resource for understanding the mechanism of action. What concerns might you raise with someone seeking to use this for detailed inhibitor design?

2. Similarity searches

It is also possible to search the sequence databases with your own nucleotide or polypeptide sequence. A number of programmes exist to do this: the most common are called BLAST and FastA, both of which are accessible from the EBI home page. Different versions of these programmes allow you to search nucleotide sequence databases with DNA or polypeptide sequences or vice versa. Both of these programmes use heuristic algorithms to find small, exactly-matching, sequences quickly, thus minimising the total amount of comparison that has to be done to a small fraction of the total possible search space. The time required to search the database is proportional to both the length of the query sequence and the size of the database. One disadvantage of these programmes are that they report only the best single alignment of the query to each database entry, which may mask weaker, but biologically significant similarities. For example, if you were to search the database with the sequence of a defined protein domain, then the programme would only report the best fit of the sequence to a gene product that contained multiple copies of that domain.

Similarity searches are available within the Proteins section of the Services section on the EBI website. But to illustrate BLAST on another site, go to

http://www.ncbi.nlm.nih.gov/BLAST

BLAST will work with either a nucleotide or protein sequence, but proteins give better discrimination between matches, there being 20 amino acids and only 4 nucleotides. Select Protein BLAST

In the Enter Query Sequence box, copy the protein sequence from the file you saved earlier. You won't need to change anything else in this section.

To answer a particular problem, you may want to search only a part of the database, for bacterial sequences for instance. Restricting the search is done in the Choose Search Set box.

For illustration, we'll look for homologues of the human GABA transporter in the model organism C.elegans. Leaving the database as Non-redundant protein sequences, type Caenorhabditis elegans in the Organism box. Note that as you start to type suggestions for possible completions are given, which you can select. You could change the Algorithm parameters if more discrimination is needed, but for now, just hit BLAST .


On the BLAST results page the Descriptions table of statistics is the default for the results. (As a check that you have entered the correct protein sequence, the top result should have an E-value of 0.0. If it doesn't, go back to the Uniprot part of this script and save the sequence to a text file, copying it out of there into the BLAST search).

To help familiarise yourself with BLAST output, start by selecting the Graphic Summary tab where the results are shown in graphical form. At the top are the results from comparing the sequence against a domain database (pale blue bars). Click on this image and from the specific hits found:  

What are the Names and Descriptions of the conserved domains that BLAST has identified within the sequence?

Back on the results window, below the domain result, the list of C.elegans sequences found are shown diagrammatically with the most similar at the top. The statistical scores for the similarity between the hit sequence and the query sequence can be found on the Descriptions tab, or by clicking on the bars. Alignments of the ‘hits’ with the query sequence are on the Alignments tab, although as explained in the lecture, these are not optimal alignments.

Knowing what we found out about this trans-membrane protein in the UniProt database, How many of these sequences are really similar to the query? You should consider both the Expectation E value, and the length of the sequence alignment, particularly considering what we know about the topological domains of this protein from the UniProt database, that the majority of the protein is composed of trans-membrane helices and the extra-membrane domains are likely to be important for function.

Returning to the Descriptions tab, Select the significant part of the score list (including at least Accession code, score and E-value) to copy and hand in.

Clicking on the Accession code (right hand column of the table in the Descriptions tab) for the first BLAST result takes you to the entry in the database for the C. elegans protein. What are its gene and Locus_tag codes? (you'll find them in the CDS section).

Click on the link to this sequence in the database 'Wormbase'

3. Wormbase

Rapid progress in sequencing entire genomes has led to the need for specialised databases for the storage and annotation of the sequence and associated information. One of the most well established of these is NCBI, which contains sequence, genetic and much other information relevant to the C. elegans and other nematodes. (Others include Flybase which collates information on Drosophila sequences). These are secondary databases that collate data from a number of primary databases such as nucleotide, protein and stucture databases. We will look at some of this information.

These secondary databases hold much more information about the gene and its encoded protein. For instance, using the information in the Location section,(which you can reach quickly using the left-hand menu) How many introns does this gene contain? (these are the thin lines between the fatter blue exons) and On which chromosome is this gene? (hint: Chromosomes are given in Roman numerals)

Where is this Na/Cl-dependent GABA transporter we identified via BLAST expressed? (use the Expression link in the left-hand margin, then scroll past the pictures to the table below and click on the Anatomy term header. Summarise what you find in this column)

Using information in the Sequences section, What is the length of protein encoded?

Clicking on the hyperlink under the Protein header in the Sequences section leads to a Protein page. From information in the Homology section (another link on the left-hand menu): How many transmembrane domains (TmHmm) does it have? (if the protein schematic picture is not visible, try the Legacy Protein Schematics link below).

3.1 sequence searching 

Such databases can be searched by either of the two methods we've illustrated above, by annotation (using the search box at the top of the page) or sequence. To search with the human GABA transporter sequence, click on Blast/Blat from the Tools pulldown menu at the top of the page.

Open the file where you saved your sequence and copy it into the sequence box on the BLAST page (possibly without the header line).

Again you can change the BLAST options and databases being used, but  we will use the default options here, so click on Submit .

A similar BLAST bar chart appears, although in shades of blue, the darker being the more similar.

Identify the three sequences that have similar function to the sequences you've looked at so far, they will have an alphanumeric code such as T03F7.1, where T03F7 is the clone that has been sequenced and .1 represents the individual gene on that clone. One of these should be the same as identified at the end of Part 2. Note the codes of the other two.

3.2 Looking at individual genes

Click on the Gene Summary link for either of the novel sequences (i.e. NOT T03F7.1) in the table below the blue bar picture.

Looking at the graphical representation of the individual transporter gene, how many introns does this gene contain?

Where is it expressed?

Identify the link to the protein sequence under the Protein header in the Sequences section.

What is the length of the protein encoded by the gene?

Click on the Protein link

How many transmembrane domains does this protein have?

(this will entail looking in the Homology section of the protein page again)

3.3 Searching for information on a specific gene

Finally, a new GABA transporter has been identified, and given the name 'unc-47'.  To find out about this transporter, use the search at the top of the Gene summary page.

Change the search to ‘for a gene’and enter ‘unc-47’ into the box.

What does this gene encode?

To what physical clone does this gene map? (in the blue panel on the right)

How many introns are present in this gene? (in the Location section)

Describe the expression pattern of this gene. (in the Expression section, less extensive than the other)

Summarise the phenotype of mutations in this gene. (Phenotype section, but also summarised in the Overview under 'From C. elegans I and II')

How many journal articles discuss this gene? (References section)

Describe the nature (length, number of TM domains…) of the protein as fully as you can. (Click on the link under the Protein header in the table in the Sequences section of the page).

3.4 Compare the proteins identified in this workshop

Discuss the similarities and differences between this gene product and those you have been examining to date. Do they belong to the same family of proteins? These are the human protein from the annotation search section, the most similar C. elegans protein from the NCBI BLAST search, the other identified in the Wormbase BLAST, and that encoded by unc-47.