GCG for a sequencing project

GCG is a suite of programs for analysing nucleotide and protein sequences. It is held on elf, a UNIX computer. See the on-line documentation for help and information (or type genhelp or genmanual at the elf $ prompt).

In this exercise we'll start with some DNA extract that we hope codes for lipoamide dehydrogenase, and which we need to sequence. In order to design primers to PCR amplify the DNA, the databases are searched for genes which may have a similar sequence.

After PCR and sequencing the initial fragment, we'll check the interpretation (A,T,G,C) that the sequencer made from the chromatogram. Only a fragment of the whole gene can be sequenced in each run. The sequence of the first fragment can be used to design primers to sequence the next fragment, and so on. When enough fragments are collected, they are assembled into the whole gene sequence. It can then be analysed for various properties of its protein.

Steps involved in this workshop:

1. Log on to elf

2. Make a first guess at the sequence

3. Design a primer to PCR amplify the gene

4. Examine the sequence chromatogram

5. Assemble the fragments

6. Predict secondary structure and look for structural motifs

7. Get on-line help

1. Log on to elf

i) From the Start button (bottom left) select All Programs, then Xming, then Elf

ii) In the new window type your BUCS username after the login as:, followed by your password when prompted.

iii) A white xterm window will appear with a elf $ prompt.

At the elf $ prompt, type

elf $ gcg

to set up GCG. Then type

elf $ seqlab &

to start up the graphical interface.

2. Guess the sequence

As in the last workshop, we're going to search the most up-to-date sequence databases for sequences which have similar function to ours and therefore possibly a similar sequence, using an array of computers in Washington DC.

Again, the site we need is Entrez Sequence Search. We are looking for the E1a chain of a lipoamide dehydrogenase in an Archaeon, and we know that the complete sequence of the related Thermoplasma acidophilum has been deposited, so this is the sequence we are looking for.

In real life it would be preferable to design primers from the consensus sequence from an alignment of many sequences. (Alignment was dealt with in the Aligning and Phylogenetics workshop). For now, in the Search across databases box type

lipoamide AND Thermoplasma acidophilum

and press GO. This searches for the words lipoamide and Thermoplasma acidophilum within the same database entry. Our results window indicates that these terms have been found in a number of areas of the www site, but to design primers we need the nucleotide sequence so select Nucleotide:.

There are more than one page of results so from the Display settings at the top of the page, sort by Taxonomy ID. One of these results is a T.acidophilum sequence, so select it (the segment AL445067, NOT the complete genome). Now we have the annotated segment of the genome. Find the lipoamide dehydrogenase alpha chain (use Edit> Find on this page. If the Edit menu isn't visible, press the Alt key and it should appear) and click on the hyperlinked gene.

This sequence is the one we want to save to our h drive. So click on the Send: menu at the top right, then choose Complete Record, then File, change the Format to FASTA then click Create File. In the next window Save the file, then change the Filename to e1a.txt and the type to All Files before you Save.

Now we need to read this sequence into Seqlab. So in the Seqlab main window File>Add Sequences From> Sequence Files, change the filter box to look on your h drive for the e1a.txt file, (something like /u/bsl/p/bspxxx/h/e1a*) Click on the file e1a.txt, then Add. The sequence should appear in the main window. Change the Mode: to Editor to show the sequence.

3. Design a primer to PCR amplify the DNA

Your lab will probably have favourite programs to use in primer design, but there are two starting points in the GCG suite. Firstly we’ll look at Prime. This program will select oligonucleotide primers for a template DNA sequence, looking through the whole sequence for possible primers and discounting those that don’t match its criteria.

To use prime, click on the sequence in the Main window, then select Functions> Primer Selection> Prime. You can change a number of the criteria, but for now just click Run.

Two outputs appear, one with text, and a graphical one. The plot can help you rapidly review the primer binding sites for the primers selected by the program. The line numbers in the plot correspond to the primer or product numbers in the text output file. Short blue lines extending above the horizontal sequence line indicate the positions of forward primers and short red lines extending below the sequence line indicate the positions of reverse primers. More details e.g. melting temperature statistics and primer-primer annealing likelihoods are given in the output file.

Design specific primers

If you prefer to design primers specifically (e.g. to the beginning and end of a particular open reading frame), and to add on restriction enzyme cleavage sites, then you can design your primers manually, and then have GCG check them for suitability with PrimePair.

The example we’re going to use is to clone our E1a lipoamide dehydrogenase into the commonly used pET19b vector, between the Xho1 and BamH1 sites.

pet19b from Novagen web site

To do this we have to add the appropriate restriction enzyme recognition sites to the ends of the gene sequence. So in the forward direction (Xho1) we need to take the gene sequence from the start codon e.g.

and add Xho I recognition site (detailed in the Novagen catalogue):

then add extra nucleotides on front (to allow DNA pol. to attach).

In the reverse direction we take the gene sequence up to the stop codon, add BamHI recognition site and extra nucleotides

Change this to the complementary strand:

and finally reverse (so strand runs 5’ to 3’):

This gives us two primers. But how can we tell whether these are a suitable pair of primers?

We use the another option, Primepair in the Primer Selection pulldown menu. Selecting Functions > Primer Selection > PrimePair brings up a new Prime Pair window.

First we have to put in our primers. Click on Forward_Primers. This brings up a window from which you can either read in primers from a file or type them in. We’ll do the latter.

Click on Create New… In the Pattern box type our forward primer:

(hint copy it from this window and press the mouse wheel in the pattern box.) Give it a Name, then press Add, then Close. Now the Forward Primer Chooser has a primer listed, so we can close this window. And we do the same for the Reverse_Primers…, selecting Create New… and typing our reverse primer into the box:

giving it a name and Add ing it. Close this window, then the Reverse Primers window.

Now we’ll look at the options that the program uses to reject primers. Click on Options…. In the lower part of this window we can lessen the restrictions on the match between Primer melting temperature, and the self annealing. Change the Maximum 3’ annealing score to 10 and the Maximum total annealing score to 20. Then click Close. Now we can Run the primer pair analysis.

An output window pops up, in the lower part of which we can see that the primers have about average GC content, a difference in primer Tm of only 1.5ºC and moderately high, but acceptable primer-primer annealing and self-annealing probabilities.

So now we have our primers.

4. From raw sequencing chromatogram to sequence

Having ordered the primers, PCR'ed the gene and sequenced the fragments, we end up with a set of sequence chromatograms. The sequencing machine automatically generates sequences (ATCG) from the chromatograms, but the human eye does it better. Thus it can be helpful to review the chromatograms to resolve any ambiguous (N) positions.

Paul Wilkinson recommends the use of the program finchTV finchtv iconto do this. You can download versions of this program for Mac, Pc or Linux from http://www.geospiza.com/finchtv if you’d like to install this in your lab and print direct to your printers but I can't put it on the BUCS computers! Here is a picture of its main features:

finchtv features

5. Assemble the fragments

a) Initialise a sequencing project - Seqmerge

Once you have all your fragment sequence files, you need to assemble them into a complete sequence. In SeqLab, the menu section that helps you to do this is

Click Run in the next window to bring up the Seqmerge Project Manager window.

Here we want to start a new assembly project, so click on

In the new Create New Project window, type

(or some other short name) at the end of the folder name in the box at the bottom. Then click OK. That window disappears and the Seqmerge project manager window now has the project you specified at the top, but without any contigs. If you'd already started a project, Seqmerge would start with the most recent project already loaded.

b)Enter the fragment sequences

Now we have a folder, we have to add some sequences. In the Seqmerge Project Manager window selecting

brings up the Add Sequences window. Change the Filter box to look in (ab1 ends in one, not the letter l)

You should see 2 files. Select both of these (<shift>click) and click on OK.

The titles of the selected sequences should then appear in the Project Manager window. Once they're there the Add sequences window can be closed by clicking Cancel.

c) Masking vector or primer sequences

One of the important refinements to sequence assembly is to mask out the sequence arising from the vector or primers. With small projects such unwanted parts of the sequence can be removed manually after assembly, but this becomes impractical with more than a few fragments to assemble, and will lead to difficulty in matching the sequence overlapping ends.

To illustrate this masking process, from the Seqmerge Project Manager window select

to bring up a new Vector Manager window.

Here click on Add from Files... and in the new Add Vector Files window change the Filter box to look for

then click Filter.

Here select the POLYLINKERdna.seq (the primer) and the pDONR222.seq (the vector), click OK then Cancel to close the Add Vectors Files window.

Next we need to define which vector sequences to use, so return to the Vector Manager window and click Select All .

Finally, to check the fragments for these vector sequences, click

This masks the sequences so they won't affect the assembly. Because it only masks the end of fragments, you may need to scan the sequences twice to mask nested vectors. You do this by clicking Scan Project again. Then Close this window. Now in the Seqmerge project manager window the two sequences have a V in the mask column to show they've had vector sequences masked.

d)Assemble the fragments

Next we want to align the fragment sequences together to form one assembly (contig). To do this select

in the Seqmerge Project Manager window. In the new Assemble Project Manager window, make sure that Assemble Current Contigs is selected, then click OK .

The Assemble window disappears and on the main Seqmerge window the number of contigs has decreased, in this case we only have one.

e)Examine the assembly - Contig Editor

Next we’d like to see how the assembly went, so < double click> on the remaining contig. This brings up the Contig Editor window, which is shown below containing a more complicated sequencing project.

seqmerge window

The Contig Editor has two main parts, the Contig Data at the top where the sequences are shown aligned and coloured by base type, and the Contig Graph at the bottom which gives a global view of the alignment and the positions of individual fragments. A dark grey rectangle in the Contig Graph shows the part of the assembled contig being displayed in the Contig Data box.

At the bottom of the Contig Data window is the consensus sequence, the bases being in upper case only if there is complete agreement between the fragment sequences at that point.

Scroll along the sequence in the upper window of your Contig Editor, (the grey box keeping track in the lower part, you can drag it instead of moving the scroll bar) until the start of the second fragment. The first part of this sequence is presented in orange to show that it has been masked out as part of the vector, which would not match the target sequence. At the other end of the overlapping region, the upper fragment is orange showing that this sequence too continued into vector. This is indicated by both sequences having a V before their names at the left of the Contig Data window.

You can see from the extent of the vector sequence that if it hadn't been masked, it may have resulted in the small true overlap region not being found.

f) Editing the contigs

Points that may need editing include those where the type of base has not been defined automatically (denoted by an N in the sequence) and those where the two sequences differ. In this case the Consensus has a lower case n. To jump to the next one of these, move back to the beginning of the first fragment and click on

This shows the top sequence has a C and the lower an A. From only two fragments we can't predict from sequence alone which is likely to be the correct base. So we have to go back to the sequencing data.

This brings up the Trace Viewer window. Here the traces for the two fragments are aligned, with each base represented by a curve of a different colour. In this case we're interested in C (blue) and A (green).

The upper trace has peaks of all four colours within the box defining the position of this base, while the lower trace has only the green (A) peak and appears less noisy than the upper. Even if you change the scale of the lower trace (by moving the left-hand slider) there is little noise. So this is convincingly an A. The green peak in the upper trace is also the tallest, so it could also be an A.

To edit the sequence, move back onto the Contig Editor window

Deleting removes the base to the left of the cursor, so click on the Upper fragment sequence, making sure the cursor is on the T to the right of the C we want to change.

Now both fragment sequences agree, the Consensus sequence changes to match, putting a capital A in place of the lowercase n. Moving on to the next Mismatch

The Contig editor leaps to the next mismatch, mirrored by the Trace Viewer. These next two errors are less clear to edit, but the lower trace looks clearer. Finally there are a number of bases at the first part of the consensus which have been left as N because the trace is not clear. See if you can edit any of them.

g) Saving your edited consensus sequence

When you've finished editing your sequence, you should save it as a sequence file. In the Contig editor choose

Make sure the Export Consensus Sequence box is selected, and type a filename such as

in the Output File box. Then click OK.

To exit seqmerge, select File > Close on the Contig Editor window. Click on Yes when asked if you'd like to save your changes. Then click on Project > Exit on the Seqmerge window.

6. Analyse the sequence

Now we can read that sequence into SeqLab. On the Seqlab Main window, from the File pulldown menu select

then put *.rsf at the end of the folder in the Filter window. Look for my.rsf (or whatever you called it) and select it (Add, then Close).

Translate it into protein: (explained in workshop 1)

Click on the consensus sequence name on the left of the Editor window, then:

Functions> Translation>Translate:

Make sure the List file button to make sure it is deselected (light grey), then Run.

The asterisks in the output window signify stop codons. Since this doesn't appear to be a straight-forward translation, we will use Map to find the correct reading frame (i.e. the one with a large section with no stop codons). We have done this process before using web-based tools in the sequence analysis tools workshop.

Start by making sure you're using the Editor Mode: in the Seqlab Main window and clicking on the sequence name. Then select Functions>Translation> Map . Make sure the No enzymes box is selected, and that all 6 reading frames and the open frames only box are also selected. When all is ready, click Run.

In the output window the DNA sequence is shown above the six possible reading frames. Looking through them, at base 820, a short (41 amino acid) openreading frame appears with no non-coding codons (indicated by ?), but there is a larger one in frame f from 1185 to 858. This means we have to reverse the direction. Do this with

When this has Run, Click on

in the Output Manager window. The reverse, complemented strand appears in the editor window. We know we need the third reading frame, so

and begin at base 3. Click Select and note that all but the first two bases change to black and white. Click Close. Now

and use the Selected region . You shouldn't need to change anything in the Translate window, just click on Run.

Now we have the correct reading frame, the next step is to:

Edit the sequence to remove non-coding regions (regions outside the innermost stop codons). Add the sequence to the SeqLab window by clicking Add to Editor on the Output window

The protein sequence appears in the Editor window. Move along the sequence with the cursor keys, and find the first M after a series of stop codons. Remove all the characters up to this point (use the delete key to remove characters, or select with the mouse, then click cut). Then find the next stop codon and remove everything after it.

When done, change the mode back to Main List and save the sequence when asked.

Look for sequences that are homologous to it...

This is best carried out on Web sites, and again was covered in the "sequence analysis" workshop.

Search for other proteins with similar structural motifs...

Select Function> Protein Analysis > Motifs. The output suggests possible functions for the protein (but not in this particular case).

Display enzyme cleavage sites on a sequence

Select Function > Translate > Map. This was the program we used to look at the possible gene translations, but can also be used for mapping cleavage sites.

Have a look at secondary structure predictions for your protein...

Function > Protein Analysis > Peptidestructure uses 2 different algorithms to predict secondary structure: Chou & Fasman (CF) and Gamier, Osguthorpe & Robson (GOR). Both are based on the propensity for certain amino acids to occur in helices (H), strands (B) or turns (T).

After Run, a coloured plot will appear showing a variety of physical predictions about the protein.

7. Getting help

For gcg help use the on-line manual or  Help on the pulldown menu in Seqlab. Close to exit help.

For UNIX help, http://www.bath.ac.uk/bucs/tools/unix/