The relationship between sequence and interaction divergence
in proteins


Improved data sets

(related to Note added in proof)

Since the original study, we have worked on a number of ways of trying to make cleaner sets of domain-domain interactions. Mostly this was to try and avoid instances of crystal-packing, but there are some other logistical changes that improve things.

In the first study we used Astral representatives to select non-redundant pairs of interactions. This is a sensible strategy, but we observed that often the selection of pairs could be rather arbitrary, and when there was more than one instance of the same interacting domains in a single database (PDB) entry, different selections for pairs of domains in different entries could give misleading results (e.g. artificially high iRMSDs).

We thought initially that a sensible thing to do would be to compare all possible instances. Unfortunately, for very large structures (such as the proteasome), this would lead to a very large number of comparisons to be made initially (from which we always chose the lowest iRMSD as explained in the paper). We thus first did a comparison of interactions of the same type, in the same PDB file, and removed "duplicates" (ie arising from symmetry, etc.) defined as any iRMSD smaller than 1.0. This produced a tractable number of interactions, which are available below.

We also noticed that intances of one domain interacting with another of the same fold could be in arbitrary "orders" (i.e A1-A2 in one structure and A2-A1 in another). To get around this we simply put in both orientations in the initial starting set (reversals.

For the cleanest possible set, we simply removed all interactions that occurred multiple times in the same file. This will remove many true interactions, but should avoid most of the crystal contacts.

So, after all of that several files are available:

Old data as described in the paper (SCOP 1.61):

New data as described above (SCOP 1.63):

(Note the in the above files, skipped interactions are simply commented out with a "#" in the first column)

In the tool we offer the ability to search the data for 4 & 5 above. We see that the number of spurious points decreases as one moves from 4->5, but note that "real" interactions may be missing from set 5.

The data for ploting:

Please cite Aloy et al, J. Mol. Biol., 332, 989-998, 2003 PubMed 14499603.

Back to the main page.