SUPERFAMILY 1.75 HMM library and genome assignments server

Supra-domain Enzyme Commission (EC) Annotations and Supra-domain Enzyme Commission Ontology

Jump to [ Top · SP2EC · SPEO · Data availability ]

This document explains the details behind EC annotations of supra-domains. The Structural Classification of Proteins (SCOP) database (Andreeva, et al., 2008) defines and classifies domains as structural units and as the smallest unit of evolution. IntEnz (Integrated relational Enzyme database) is a resource focused on enzyme nomenclature, which is a system of naming enzymes (protein catalysts) with Cross-references to UniProts. Together with genome-wide domain assignments for proteins in the SUPERFAMILY database (Gough, 2006), we have performed statistical inference for detecting EC ontology relatedness to structural domains using an automated procedure developed by us (de Lima Morais, et al., 2011). In addition to these domain-centric annotations, most domains themselves may not just work alone. In multi-domain proteins, they may be combined together to form distinct domain architectures. The recombination of the existing domains is considered as one of major driving forces for structural/functional/enzymatic/phenotypic diversificaation. In particular, certain pair-wise domain combinations (or triplets or more) may occur in diverse domain architectures and thus can be viewed as larger evolutionary units (termed supra-domains). Although supra-domains are clearly of evolutionary importance, their structural/functional/enzymatic/phenotypic remain uncharacterized. In practice, they are far more difficult than individual domains to curate by manually examining the annotations of multi-domain proteins they reside in. To facilitate the understanding of how domain combinations contribute to function/enzymatic diversifications, we here extend the utility of the previous framework in capturing EC terms suitable for supra-domains in addition to individual superfamilies (Figures 1-4). At the core of this framework is that, if a EC term tends to annotate individual-domain-containing proteins (or proteins containing a supra-domain), then this term should also confer functional signals for that single-domain (or supra-domain).


The pipeline of building supra-domain EC annotations

Jump to [ Top · SP2EC · SPEO · Data availability ]

The implementation of this framework starts from high-coverage domain architectures and Protein/Uniprot-level EC annotations, available respectively from SUPERFAMILY (de Lima Morais, et al., 2011) and IntEnz (Figure 1). Two types of inference between a supra-domain (individual superfamily) and a EC term are performed in terms of the root and in terms of direct parental EC (Figure 2). These dual constraints make sure that only the most relevant EC terms are retained.

Figure 1. A general framework for inferring EC annotations for evolutionary SCOP domains and supra-domains using domain architectures and EC annotations for Uniprots (obtained from IntEnz and SUPERFAMILY, respectively).

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of SP-EC term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel). SP denotes both of supra-domains and individual superfamilies.


Obtaining supra-domain mammalian phenotype ontology

Jump to [ Top · SP2EC · SPEO · Data availability ]

Based on predicted SP2EC annotations, we have also initialized a trimmed-down version of EC which is the most informative to annotate supr-domains (including individual superfamilies) (Figure 3).

Figure 3. Flowchart of creating supra-domains EC ontology (SPEO) based on information theoretic analysis of SP2EC annotation profiles.

    First, we apply information theory to define information content (IC) of a EC term: negative log10-transformation of the frequency of observing SP annotated to that term. For any SP, EC terms annotated to that SP constitute an SP-EC annotation profile, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among EC terms (or so-called true-path rule), an SP directly annotated to a specific EC term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). EC annotations generated above can be considered as direct annotations. The complete EC annotations (direct and inherited) are used to calculate IC for all EC terms. Of note, those EC terms with similar IC can represent a partition of DAG in terms of SP2EC.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all EC terms, and iteratively identifies unmarked EC terms closest to a predefined IC until all EC terms are marked (Figure 4). To make sure that one and only one EC term can be identified per path in DAG, the following constraints should be met: If multiple EC terms with identical IC are identified in the same path, those parental terms are filtered out; once a EC term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified EC terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SPEO, respectively corresponding to EC terms with four levels (least informative, moderately informative, informative, highly informative). In summary, we provide a meta-EC as a proxy for annotating both supra-domains and individual superfamilies.

Figure 4. Illustration of the algorithm how to iteratively create supra-domain EC ontology (SPEO). I). Initially, all EC terms are unmarked (open circles); II). Identify those unmarked EC terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental EC terms from identified EC terms in Step II. IV). Mark EC terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked EC terms until all EC terms are marked. VII). Output only those identified EC terms with IC falling in the range (e.g., [0.75 1.25]) as SPEO.


Data Availability

Jump to [ Top · SP2EC · SPEO · Data availability ]

In additional to EC-Hierarchy for the browsing, we here also provide SP2EC mapping results in two parsable formats (i.e., plain files and mysql tables).

SP2EC mapping results

  • Full supre-domains (including individual superfamilies) EC annotations are available in the SP2EC.txt file.

  • EC terms which are regarded as SPEO (four levels: least informative, moderately informative, informative, and highly informative ) can be found in the SPEO.txt file. Unlike the whole EC hierarchy, those EC terms at different granularity are representative and comprehensive in terms of their relevance to supre-domains (including individual superfamilies). Keep it in mind that SPEO corresponds to only SCOP superfamily level.

  • We highly recommend users to use these EC terms in SPEO.txt and their annotating supra-domains extracted from SP2EC.txt. They are of poteintal use in comparative enzymatic genomics, particularly in understanding how multi-domain proteins have evolved under enzymatic constraints along the tree of life.

SP2EC MySQL tables
    We use four tables (SP2EC.sql.gz) below to store info described above (i.e., SP2EC mapping results):

    EC_info: containing info about EC terms.
        > DESC EC_info;
        +-------------+----------------------------------+------+-----+---------+-------+
        | Field       | Type                             | Null | Key | Default | Extra |
        +-------------+----------------------------------+------+-----+---------+-------+
        | ec          | varchar(15)                      | NO   | PRI | NULL    |       |
        | namespace   | enum('root','enzyme_commission') | NO   |     | NULL    |       |
        | description | varchar(255)                     | NO   |     | NULL    |       |
        | distance    | tinyint(3) unsigned              | NO   |     | NULL    |       |
        +-------------+----------------------------------+------+-----+---------+-------+
        
    • The ec column is the EC id, see IntEnz - Classification rules. It is browsable via EC-Hierarchy.
    • The namespace column is mammalian_phenotype, otherwise root.
    • The description column shows the full name of EC terms.
    • The distance column shows the distance of EC terms to the root.

    EC_hie: containing info about EC hierarchy.
        > DESC EC_hie;
        +----------+---------------------+------+-----+---------+-------+
        | Field    | Type                | Null | Key | Default | Extra |
        +----------+---------------------+------+-----+---------+-------+
        | parent   | varchar(15)         | NO   | PRI | NULL    |       |
        | child    | varchar(15)         | NO   | PRI | NULL    |       |
        | distance | tinyint(3) unsigned | NO   | PRI | NULL    |       |
        +----------+---------------------+------+-----+---------+-------+
        
    • The parent column is the EC id.
    • The child column is the EC id.
    • The distance column shows the distance of parental EC id to child EC id. 1 for direct parent-child relationships, others indicating the existance of a path between them (reachable but indirect).

    EC_mapping_supradomain: containing info about SP2EC annotations.
        > DESC EC_mapping_supradomain;
        +----------------+---------------------------+------+-----+---------+-------+
        | Field          | Type                      | Null | Key | Default | Extra |
        +----------------+---------------------------+------+-----+---------+-------+
        | supradomain    | text                      | NO   |     | NULL    |       |
        | level          | enum('cl','cf','sf','fa') | NO   |     | NULL    |       |
        | ec             | varchar(15)               | NO   | MUL | NULL    |       |
        | all_score      | double                    | NO   |     | 1       |       |
        | inherited_from | text                      | YES  |     | NULL    |       |
        +----------------+---------------------------+------+-----+---------+-------+
        
    • The supradomain is a comma separated list of the SCOP unique identifier, sunid. It is browsable via SCOP-Hierarchy.
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The ec column is the EC id.
    • The all_score column is the FDR supported by all UniProts (including multidomain UniProts).
    • The inherited_from column is to mark the status of SP2EC predicted annotations. 1) If it is marked with 'directed' (i.e., 'all_score'<0.001), SP2EC is significantly supported by all UniProts (including multidomain UniProts). 2) If it is a comma separated list of EC id (numeric part; the column 'all_score' is not less than 0.001), SP2EC is inherited from any descentant EC terms (significantly associated) when applying true-path rule in DAG. 3) Empty otherwise. Hence, the lists of SP2EC supported only by all can be obtained by selecting the column 'inherited_from' with NOT EECTY.

    EC_ic_supra: containing info about SPEO.
        > DESC EC_ic_supra;
        +---------+---------------------------+------+-----+---------+-------+
        | Field   | Type                      | Null | Key | Default | Extra |
        +---------+---------------------------+------+-----+---------+-------+
        | level   | enum('cl','cf','sf','fa') | NO   | PRI | NULL    |       |
        | ec      | varchar(15)               | NO   | PRI | NULL    |       |
        | ic      | double                    | YES  |     | NULL    |       |
        | include | tinyint(2)                | YES  | MUL | NULL    |       |
        +---------+---------------------------+------+-----+---------+-------+
        
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The ec column is the EC id.
    • The ic column shows the infomration content of the EC term.
    • The include column indicates whether or not the EC term belongs to the SPEO. If the column is set to '0' then it is not a member of SPEO. Otherwise, '1' for least informative (i.e., the most general), '2' for moderately informative, '3' for informative, '4' for highly informative (i.e., the most specific).


References

    Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
    Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society Series B-Methodological, 57, 289-300. Abstract [ PubMed ]  
    Chothia C, Gough J. (2009) Genomic and structural aspects of protein evolution, Biochem J 419: 15-28. Abstract [ PubMed ]  
    de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method, Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]  
    Fleischmann, A., Darsow, M., Degtyarenko, K., Fleischmann, W., Boyce, S., Axelsen, K.B., Bairoch, A., Schomburg, D., Tipton, K.F. and Apweiler, R. (2004) IntEnz, the integrated relational enzyme database, Nucleic Acids Res, 32, D434-7. Abstract [ PubMed ]