SUPERFAMILY 1.75 HMM library and genome assignments server

Generating the SUPERFAMILY Domain Architectures 'comb' Table

(Christine Vogel, Mary Pacold, and Julian Gough)

This page describes the creation of the 'comb' table in the SQL database. The comb table has an entry for each protein in SUPERFAMILY. The protein is represented as a sequence of SCOP domains rather than amino acids. This condenses the information into a very convenient format for analysis. We call the organisation of domains the 'architecture' of the protein, and the composition and interaction of the domains of the architecture determine the function of the protein.
This table is at the heart of the engine for various tools in SUPERFAMILY which compare protein architectures between whole genomes, and also for individual sequences.

Summary

The individual assignments of SCOP domains to genome sequences were converted into a string of superfamily domains for each protein; the order is defined by the sequence of the centre-points of each domain. The assignments of known structural domains do not always cover the entire sequence. So as well as the ordering of the assigned domains, it was necessary to determine the presence of unknown (not assigned, possibly of unknown structure) domains before, after and in betweeen the assigned domains. It is impossible to determine the number of unknown domains in any given unassigned region, so each region was simply labelled as a gap containing one or more unknown domains. Both the presence of gaps and the process of forcing domains into a simple string (when some domains may actually be interlaced or inserted within each other), means that the string will not always be a complete description of the architecture. However, more importantly, the procedure is unambiguous in that each domain architecture will be assigned a unique string.

The figure below outlines the procedure.

Overview of the algorithm to convert domain assignments into strings.

The algorithm

For the SUPERFAMILY assignments to be converted into a string of domains, first the order must be determined. There are at least two factors which complicate this: the assignments are very good at detecting the correct number of domains in the correct place yet (in the case of very distant homologues) not always good at accurately predicting domain boundaries. Furthermore some domains are not linear in sequence and are interlaced or inserted within each other. See the following example, there are only two domains present:

How is the order decided? It is more important for relating proteins to each other using strings of domains describing domain architectures that the assignment algorithm is unambiguous rather than that each individual protein is fully described. I.e. each string should be unique. In order to do that, the centre-point of each domain is picked, and the order is calculated from the sequence of the domain centre-points. Calculating the positions of the domain centre-points along the sequence is non-trivial since the assignments are sometimes incomplete and only recognise a short conserved region which may lie at the far end of the sequence. (As is the case for the third domain in the overview figure.) The assigned region is aligned to the model which it has hit, and the position of the centre-point of the model is chosen. Should the centre of the model be outside the bounds of the assignment then its position is projected along to a theoretical position on the sequence outside the area of assignment. In that case we assume that the gene prediction is incomplete and the whole domain exists in the protein.

There is a further wrinkle in all this, and one with a less simple solution: Proteins consist of a string of domains joined by usually short (less than 30 amino acids) linker sequences. The structural domain assignments do not cover all of the sequence leaving unassigned regions longer than 30 amino acids (see here for the exact coverage). In those unassigned regions, we expect domains of unknown structure (no SUPERFAMILY assignment) and length in the unassigned regions. We need to decide whether or not an unassigned domain lies between any two domains, or a domain and the N or C terminus of the sequence. For example, in the following sequence it is estimated that there is a missing (unassigned) domain in the region between the domains labelled 1 and 5 and another between 4 and 2, yet there is none between 5 and 4 or between the N terminal and the domain labelled 1:

Before we describe the algorithm to decide whether an unassigned region contains an unassigned domain or is merely a short linker region, we look at some statistics on the domain superfamilies in SCOP. There are roughly 1000 SCOP superfamilies (see here for up to date numbers), and each superfamily has one or more members of solved 3-D structure. The mean and standard deviation of the lengths of every domain were calculated. Some members of a superfamily are very similar to each other but others are very different. To correct for this unevenness in the database the mean and standard deviation were calculated using a weighting scheme much in the same way as pseudo-counts are used in sequence profile methods. The standard deviation describes the variability of the domain length. The higher the variability in length of a domain is, the higher is its standard deviation of the length, and the more likely it is for this domain to be longer than its average length and cover an unassigned region rather than to have a gap in the unassigned region. The smallest domain is 27 amino acids long; most of the domains are about 110 amino acids long (geometric mean), and two thirds of all domains are between 50 and 350 aa long.

Figure A. Length Distribution of SCOP domains. The arithmetic mean is 175 aa +- 119 aa. The geometric mean is 110 aa

The maximum length of an unassigned region found was 11 000 aa, but unassigned regions longer than 250 aa occur very infrequently.

Figure B. Length Distribution of Unassigned Regions. Plotted with 0 and 1 standard deviation added. Most unassigned regions are smaller than 30 aa. Unassigned regions larger than 250 aa only have frequencies close to 1. The maximum length of an unassigned region is 11 000 aa.

We assessed the number of gaps found depending on the number of standard deviations added to the domain length and the threshold length taken for a gap to be present. The number of standard deviations added to the domain length is a measurement for the probability that the two domains are adjacent to each other with no gap in between. The total number of gaps that occur is less sensitive to the number of standard deviations (from 0 to 1) added to each domain length than to the threshold gap length. Hence the threshold length of a gap is the crucial parameter.

Figure C. Number of Gaps dependent on Standard Deviation and Threshold Gap DomainLength. Plotted is the number of gaps in the complete data set with 0 and 1 standard deviation added to each domain length and a range of threshold gap lengths from 0 to 600.

The centres which were calculated earlier for the ordering of assigned domains come in useful now. The centre is used as an anchor for the domain, and from there it is possible to use the mean and standard deviation of the length of the domain to give a distribution to the possible domain boundary (see overview figure). Looking at the distributions between any two domains it is possible to estimate whether or not there may be an unknown domain between the two assigned ones. If the gap between any two assigned domains (or an assigned domain and an N or C terminus) is longer than a certain threshold length when both boundaries are calculated at one standard deviation from the normal, then it is deemed to be unlikely that the domain boundaries meet, and therefore there must be another domain in between them.
The two parameters that influence this estimate are the number of added standard deviations and the threshold length of assignment of a gap. The relationship of those two parameters and their influence on the number of gaps has been described above.

In order to derive a reasonable threshold length for a gap, we estimated the maximum expected number of unknown domains in the whole data set: Our data set comprises assignments of 262 730 domains to 167 843 sequences. All domains taken together cover 70% of the length of all sequences; that leaves 30 % unassigned regions with gaps. Under the assumption that the length distribution of unknown domains is the same as the length distribution of known domains, we expect a maximum of 113 878 domains in the 30 % unassigned regions. Since every unassigned region only counts once as on gap irrespective whether several unknown domains are in this region, the actual number of gaps will be even lower. Knowing that, we can calculate the standard deviation and threshold gap length values needed to obtain fewer than 113 878 gaps. We see, that at, for example, 1 standard deviation, a threshold gap length of at least 30 aa is needed to obtain a maximum of 113 878 gaps. For less added standard deviation or a smaller threshold gap length, the number of gaps would be over-estimated.

Figure D. Standard deviation and Threshold gap length at a fixed number of gaps (110 000 to 113878).

Testing

Parameter Set A B C
Standard Deviation Factor 0 1 2
Gap Threshold Length 60 30 12
Number of Gaps in 71 genomes 112184 112814 110421

Table 1. Parameter sets: A: stv 0 gapl 60, B: stdv 1 gapl 30, C: stv 2 gapl 12.

Three different possible parameter sets (Table 1: set A: stv 0 gapl 60, set B: stdv 1 gapl 30, set C: stv 2 gapl 12) were tested:
I) For all domains, in four genomes (dm, ec, mg, ce) two sample sequences were taken and the number of gaps compared. In cases, where in one set a gap had been predicted but not in the other, the domain assignment was inspected by eye and judged whether a gap could fit in. In each set A and B, six gaps have been correctly predicted but were absent in the other set. For set B, further four gaps were ambiguous, one gap was over-predicted. No clear conclusion from that about whether A or B is better.
II) Thirteen well-characterised Ig proteins from D.melanogaster and C.elegans have been inspected; preferably sequences with many different domains have been chosen. Out of these sequences, nine had exactly the same prediction of gaps in all three parameter sets, and were considered correct. In two sequences, set A had one over-predicted gap. In three sequences, set C had one over-predicted gap. Overall, all three sets performed well in predicting no gap between consecutive (Ig) domains and in predicting gaps in longer unassigned regions. Most of the differences occurred in very short unassigned regions, mainly at the end of the sequences (only one neighbouring domain -> only one stdv'shortening' the region).
III) Two well-characterised Cadherin proteins from C. elegans (W02B9.1 and F15B9.7) have been inspected. In all three parameter sets, the prediction of gaps was identical. All three sets performed well in predicting no gap between Cadherin domains and gaps in unassigned regions.

All three parameter sets did reasonably well. Gaps tend to be 'over-predicted' at the end of sequences. However, this is less crucial if, later on, one queries for a particular combination of domains. Generally, hardly any false-positive predictions of gaps were found. Set C performed worse than A and B.

From these tests, one reasonable set of parameters to estimate the presence of gaps is one standard deviation added to the domain lengths and a threshold gap length of 30 aa.