Secondary Structure Prediction methods and links

There are now many web servers for structure prediction, here is quick summary:

With no homologue of known structure from which to make a 3D model, a logical next step is to predict secondary structure. Although they differ in method, the aim of secondary structure prediction is to provide the location of alpha helices, and beta strands within a protein or protein family.

Methods for single sequences

Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.

Some good references on the subject:

Recent improvments

The availability of large families of homologous sequences revolutionised secondary structure prediction. Traditional methods, when applied to a family of proteins rather than a single sequence proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actually much more useful than those for single sequence, since they tend to predict the core accurately. Moreover, the limit of 70-80% may be a function of secondary structure variation within homologous proteins.

Automated methods

There are numerous automated methods for predicting secondary structure from multiply aligned protein sequences. Some good references on the subject include (the acronyms in parentheses given after each reference refer to the associated WWW servers, given below):

Nearly all of these now run via the world wide web. For individual details, see the papers for the individual methods, or click on the underlined acronyms given after most of the references given above (note that you can also run the methods by going to the approriate WWW site).

Manual intervention

It has long been recognised that patterns of residue conservation are indicative of particular secondary structure types. Alpha helices have a periodicity of 3.6, which means that for helices with one face buried in the protein core, and the other exposed to solvent, will have residues at positions i, i+3, i+4 & i+7 (where i is a residue in an a helix) will lie on one face of the helix. Many alpha helices in proteins are amphipathic, meaning that one face is pointing towards the hydrophobic core and the other towards the solvent. Thus patterns of hydrophobic residue conservation showing the i, i+3, i+4, i+7 pattern are highly indicative of an alpha helix.

For example, this helix in myoglobin has this classic pattern of hydrophobic and polar residue conservation (i = 1):

Similarly, the geometry of beta strands means that adjacent residues have their side chains pointing in oppposite directions. Beta strands that are half buried in the protein core will tend to have hydrophobic residues at positions i, i+2, i+4, i+8 etc, and polar residues at positions i+1, i+3, i+5, etc.

For example, this beta strand in CD8 shows this classic pattern:

Beta strands that are completely buried (as is often the case in proteins containing both alpha helices and beta strands) usually contain a run of hydrophobic residues, since both faces are buried in the protein core.

This strand from Chemotaxis protein CheY is a good example:

The principle behind most manual secondary structure predictions is to look for patterns of residue conservation that are indicative of secondary structures like those shown above. It has been shown in numerous successful examples that this strategy often leads to nearly perfect predictions. The work of Barton et al, Nierman & Krischner, Bazan and Benner & co-workers provide good starting points for getting doing this sort of work oneself. Some useful references are:

A strategy for secondary structure prediction

In practice, I recommend getting as many state-of-the-art prediction approaches as possible and combining this with some human insight to give a consensus prediction for the family. If you then align all of your predictions (including ideas you have based on residue conservation) with your multiple sequence alignment you can get a consensus picture of the structure. For example, here is part of an alignment of a family of proteins I looked at recently:

In this figure, three automated secondary structure predictions (PHD, SOPMA and SSPRED) appear below the alignment of 12 glutamyl tRNA reductase sequences. Positions within the alignment showing a conservation of hydrophobic side-chain character are shown in yellow, and those showing near total conservation of non-hydrophobic residues (often indicative of active sites) are coloured green.

Predictions of accessibility performed by PHD (PHD Acc. Pred.) are also shown (b = buried, e = exposed), as is a prediction I performed by looking for patterns indicative of the three secondary structure types shown above. For example, positions (within the alignment) 38-45 exhibit the classical amphipathic helix pattern of hydrophobic residue conservation, with positions i, i+3, i+4 and i+7 showing a conservation of hydrophobicity, with intervening positions being mostly polar. Positions 13-16 comprise a short stretch of conserved hydrophobic residues, indicative of a beta-strand, similar to the example from CheY protein shown above.

By looking for these patterns I built up a prediction of the secondary structure for most regions of the protein. Note that most methods - automated and manual - agree for many regions of the alignment.

Given the results of several methods of predicting secondary structure, one can build up a consensus picture of the secondary structure, such as that shown at the bottom of the alignment above.

Note that you can get predictions like the above (i.e. consensus predictions) from the very useful JPRED server.

Slides on this subject from my talk: Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12

Next fold recognition.

Back to the Flowchart