Problems with automated assessments in the CASP New Fold category

Although we agree that CASP assessment should ultimately be fully automated, as the New Fold assessors for CASP5, we felt it necessary to clarify here some of the problems and limitations that currently prevent this from happening. We and previous assessors have found that the numerical measures for structural similarity are not yet sensitive enough to permit assessments to be done automatically for targets adopting new folds.

GDT_TS is a very powerful and effective measure of the similarity between predicted and correct structures. It is particularly useful in assessments of predictions built from templates (i.e. Fold recognition or Comparative modelling), which indeed it was largely developed for. However, we, like previous assessors (e.g. Arthur Lesk, pers. comm.), have noticed that numerical evaluations in the New Folds category are only reliable as a guide, since inspection shows anomalies where predictions that experts consider to be wrong do better than those considered to be correct. This is, perhaps, not surprising, as the majority of predictions within the 13 NF+NF/FR targets from CASP5 have GDT_TS values below 30, which Adam Zemla has suggested is a kind of twilight zone:

Preds with
Target Length GDT_TS>=30
T0129 170 2 / 484
T0146_1 107 0 / 391
T0146_2 89 21 / 357
T0146_3 56 5 / 280
T0149_2 116 2 / 369
T0161 162 0 / 409
T0162_3 168 0 / 372
T0170 69 349 / 457 *
T0172_2 104 4 / 381
T0173 287 0 / 454
T0181 92 2 / 416
T0186_3 36 120 / 351 *
T0187_1 185 0 / 406

(* we think there is a length effect here akin to that for % sequence identity i.e. higher values are more likely for short proteins; this is clear when one considers that very few groups predicted anything resembling the correct structure for T0186_3; many groups predicted good structures for T0170).


The figures show two examples of what we considered to be anomalies. In Figure 1 a prediction of an all-alpha protein when compared to the true alpha+beta structure (T0162_3) gets a GDT_TS rank of 13 owing only to a single long helix present in both structures. At rank 14 is a prediction where five beta-strands of the central beta-sheet have been predicted correctly. We considered only the second prediction worthy of one point (out of two), as we feel it is a more useful prediction of the overall structure (note that 'GOOD' implies 1 point and 'OK' implies an honourable mention; the Rank 13 prediction was considered to be 'WRONG').


Figure 1

In Figure 2 a prediction at rank 2 has fragments on top of each other, leading to an ambiguous beta-sheet and contains numerous close contacts. It scores better than one that has identified what are clearly correct features of the topology (at rank 4). Only the latter got points in the visual assessment (1 out of a possible 2, since the overall fold is still not correct). The different colours show the location of key super-secondary structure features that we tended to look for when evaluating predictions manually. We have investigated the phenomenon of overlapping coordinates further, and found that it is possible to improve GDT_TS by overlaying fragments from arbitrarily selected predictions; we strongly suggest that organisers & assessors of future CASPs look at this problem in more detail and attempt to correct for it.


Figure 2

Another anomaly is that GDT_TS often awards similar scores to predictions containing only fragments and not the correct overall fold. For certain servers and groups this turns out to be the explanation for the disagreement between the visual and automated assessment: the lack of human intervention or a preliminary fragment assembly algorithm meant that they got good GDT_TS ranks but only partial scores with the visual assessment.

Note that our visual assessment was ruthless: only giving full points to predictions approaching the overall fold. We chose this strategy as we knew from CASP3 & 4 that the community was able to predict correct folds for many targets. CASP5 showed that this strategy did actually give rankings in agreement with numerical evaluations, with exceptions owing to the anomalies above.


More specifically related to Michael Levitt's assessment. We agree with David Jones' comment regarding the choice of Z-score cut-off. The total score depends on what Z value one uses. We observed that choosing a comparatively high value of Z=2.5 gives a ranking that is in closer agreement with our visual assessment. This is perhaps not surprising, as a high threshold would be necessary to discern the few predictions that were above the GDT_TS twilight zone.

We think it is clear that a human assessment is critical in this category for the above reasons. GDT_TS is central to CASP evaluation, but it is not meant to be "official" or necessarily a standard of truth. History has shown that automated evaluations can miss key details that only a human can spot.

Predictors should of course consider alternative schemes for assessment. Such studies can help them to improve methods by understanding what aspects of structures they are able to predict, etc. However, we emphasise that it is critical that the official assessment is done in an objective fashion: by an external expert, who does not participate, and who does assessments blind (i.e. without the knowledge of the predictors identities). It is, as the cliche goes, always possible to find some scheme that ranks your method as the best.

Please bear in mind that CASP assessments are the result of months of very hard work. We have looked at over a thousand predictions, and toiled long and hard to try to produce a ranking that agreed with common sense and was fair to the predictors, and we hope that we have done justice to these aims. We would of course encourage people to explore different ranking schemes, but when doing so bear in mind the above problems, and above all: look at the predictions to see if you think they are close to the correct structure. It is important for the community to do this, lest we end up providing predictions that despite being numerically accurate, disagree with human inspection or - worse still - the laws of chemistry.

Full details of our assessment of the New Fold category will be published in the CASP5 special issue of PROTEINS: Struct, Funct, Genet. Tables and data related to our assessment are available here.

Patrick Aloy, Alex Stark, Caroline Hadley and Rob Russell
Structural Bioinformatics Group
EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany