SUPERFAMILY 1.75 HMM library and genome assignments server

Supra-domain Gene Ontology (GO) Annotations and Supra-domain Functional Ontology

Jump to [ Top · SP2GO · SPFO · Data availability ]

This document explains the details behind GO annotations of supra-domains. The Structural Classification of Proteins (SCOP) database (Andreeva, et al., 2008) defines and classifies domains as structural units and as the smallest unit of evolution. When it comes to function-Gene Ontology (GO), however, we are accustomed to considering whole proteins despite the fact that very often the domain is not only the structural and evolutionary unit, but also the functional unit. For this reason we present a novel domain-centric GO method (de Lima Morais, et al., 2011). Although these domain-centric annotations hold great promise in describing functionally independent domains, most domains themselves may not just function alone. In multi-domain proteins, they may be combined together to form distinct domain architectures, thus exerting neo-functions or more specific functions (Chothia and Gough, 2009). The recombination of the existing domains is considered as one of major driving forces for gaining functions in multi-domain proteins. In particular, certain pair-wise domain combinations (or triplets or more) may occur in diverse domain architectures and thus can be viewed as larger evolutionary units (termed supra-domains). Although supra-domains are clearly of evolutionary importance, their functions remain uncharacterized. In practice, they are far more difficult than individual domains to curate by manually examining the functions of multi-domain proteins they reside in. To facilitate the understanding of how domain combinations contribute to function diversifications, we here extend the utility of the previous framework in capturing GO terms suitable for supra-domains in addition to individual superfamilies (Figures 1-4). At the core of this framework is that, if a GO term tends to annotate individual-domain-containing proteins (or proteins containing a supra-domain), then this term should also confer functional signals for that single-domain (or supra-domain).


The pipeline of building supra-domain GO annotations

Jump to [ Top · SP2GO · SPFO · Data availability ]

The implementation of this framework starts from high-coverage domain architectures and high-quality GOAs for Uniprots, available respectively from SUPERFAMILY (de Lima Morais, et al., 2011) and UniproKB-GOAs (Barrell, et al., 2009) (Figure 1). We respect the hierarchical structure of GO, which is organized as a directed acyclic graph (DAG) by viewing an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges. Accordingly, two types of inference between a supra-domain (individual superfamily) and a GO term are performed in terms of the root and in terms of direct parental GO (Figure 2). These dual constraints make sure that only the most relevant GO terms are retained.

Figure 1. A general framework for inferring GO annotations for evolutionary SCOP domains and supra-domains using domain architectures and GO annotations for Uniprots (obtained from UniProtKB-GOA and SUPERFAMILY, respectively).

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of SP-GO term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel). SP denotes both of supra-domains and individual superfamilies.


Obtaining supra-domain functional ontology

Jump to [ Top · SP2GO · SPFO · Data availability ]

Based on predicted SP2GO annotations, we have also initialized a trimmed-down version of GO which is the most informative to annotate supr-domains (including individual superfamilies) (Figure 3).

Figure 3. Flowchart of creating supra-domains functional ontology (SPFO) based on information theoretic analysis of SP2GO annotation profiles.

    First, we apply information theory to define information content (IC) of a GO term: negative log10-transformation of the frequency of observing SP annotated to that term. For any SP, GO terms annotated to that SP constitute an SP-GO annotation profile in DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among GO terms (or so-called true-path rule), an SP directly annotated to a specific GO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). GO annotations generated above can be considered as direct annotations. The complete GO annotations (direct and inherited) are used to calculate IC for all GO terms. Of note, those GO terms with similar IC can represent a partition of DAG in terms of SP2GO.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all GO terms, and iteratively identifies unmarked GO terms closest to a predefined IC until all GO terms are marked (Figure 4). To make sure that one and only one GO term can be identified per path in DAG, the following constraints should be met: If multiple GO terms with identical IC are identified in the same path, those parental terms are filtered out; once a GO term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified GO terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SPFO, respectively corresponding to GO terms with four levels (least informative, moderately informative, informative, highly informative). In summary, we provide a meta-GO as a proxy for annotating both supra-domains and individual superfamilies at three sub-ontologies including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC).

Figure 4. Illustration of the algorithm how to iteratively create structural domains functional ontology (SPFO). I). Initially, all GO terms in DAG are unmarked (open circles); II). Identify those unmarked GO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental GO terms from identified GO terms in Step II. IV). Mark GO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked GO terms until all GO terms are marked. VII). Output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as SPFO.


Data Availability

Jump to [ Top · SP2GO · SPFO · Data availability ]

In additional to GO-Hierarchy for the browsing, we here also provide SP2GO mapping results in two parsable formats (i.e., plain files and mysql tables).

SP2GO mapping results

  • Full supre-domains (including individual superfamilies) GO annotations are available in the SP2GO.txt file.

  • GO terms which are regarded as SPFO (four levels: least informative, moderately informative, informative, and highly informative ) can be found in the SPFO.txt file. Unlike the whole GO hierarchy, those GO terms at different granularity are representative and comprehensive in terms of their relevance to supre-domains (including individual superfamilies). Keep it in mind that SPFO corresponds to each of three GO sub-ontologies (i.e., BP, MF, and CC ) only at SCOP superfamily level.

  • We highly recommend users to use these GO terms in SPFO.txt and their annotating supra-domains extracted from SP2GO.txt. They are of poteintal use in comparative functional genomics, particularly in understanding how multi-domain proteins have evolved under functional constraints along the tree of life.

SP2GO MySQL tables
    We use four tables (SP2GO.sql.gz) below to store info described above (i.e., SP2GO mapping results):

    GO_info: containing info about GO terms.
        > DESC GO_info;
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        | Field      | Type                                                                        | Null | Key | Default | Extra |
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        | go         | int(7) unsigned zerofill                                                    | NO   | PRI | NULL    |       | 
        | namespace  | enum('biological_process','molecular_function','cellular_component')        | NO   | MUL | NULL    |       | 
        | name       | varchar(255)                                                                | NO   |     | NULL    |       | 
        | synonym    | text                                                                        | YES  |     | NULL    |       | 
        | definition | text                                                                        | YES  |     | NULL    |       | 
        | distance   | tinyint(3) unsigned                                                         | NO   |     | NULL    |       | 
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        
    • The go column is the numeric part of GO id. It is browsable via GO-Hierarchy.
    • The namespace column can be one of three GO sub-ontologies.
    • The name column shows the full name of GO terms.
    • The synonym column is the synonym of GO terms.
    • The definition column is the definition of GO terms.
    • The distance column shows the distance of GO terms to the corresponding sub-ontology.

    GO_hie: containing info about GO hierarchy.
        > DESC GO_hie;
        +----------+--------------------------+------+-----+---------+-------+
        | Field    | Type                     | Null | Key | Default | Extra |
        +----------+--------------------------+------+-----+---------+-------+
        | parent   | int(7) unsigned zerofill | NO   | PRI | NULL    |       | 
        | child    | int(7) unsigned zerofill | NO   | PRI | NULL    |       | 
        | distance | tinyint(3) unsigned      | NO   | PRI | NULL    |       | 
        +----------+--------------------------+------+-----+---------+-------+
        
    • The parent column is the numeric part of parental GO id.
    • The child column is the numeric part of child GO id.
    • The distance column shows the distance of parental GO id to child GO id. 1 for direct parent-child relationships, others indicating the existance of a path between them (reachable but indirect). Notably, each edge in GO DAG can be one of three relationships: 'is_a', 'part_of', and 'regulates'. Here, we only consider the first two (i.e., 'is_a' and 'part_of') and treat them equally.

    GO_mapping_supradomain: containing info about SP2GO annotations.
        > DESC GO_mapping_supradomain;
        +----------------+---------------------------+------+-----+---------+-------+
        | Field          | Type                      | Null | Key | Default | Extra |
        +----------------+---------------------------+------+-----+---------+-------+
        | supradomain    | text                      | NO   | MUL | NULL    |       |
        | level          | enum('cl','cf','sf','fa') | NO   |     | NULL    |       |
        | go             | int(7) unsigned zerofill  | NO   |     | NULL    |       |
        | all_score      | double                    | NO   |     | 1       |       |
        | inherited_from | text                      | YES  |     | NULL    |       |
        +----------------+---------------------------+------+-----+---------+-------+
        
    • The supradomain is a comma separated list of the SCOP unique identifier, sunid. It is browsable via SCOP-Hierarchy.
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The go column is the numeric part of GO id.
    • The all_score column is the FDR supported by all UniProts (including multidomain UniProts).
    • The inherited_from column is to mark the status of SP2GO predicted annotations. 1) If it is marked with 'directed' (i.e., 'all_score'<0.001), SP2GO is significantly supported by all UniProts (including multidomain UniProts). 2) If it is a comma separated list of GO id (numeric part; the column 'all_score' is not less than 0.001), SP2GO is inherited from any descentant GO terms (significantly associated) when applying true-path rule in DAG. 3) Empty otherwise. Hence, the lists of SP2GO supported only by all can be obtained by selecting the column 'inherited_from' with NOT EMPTY.

    GO_ic_supra: containing info about SPFO.
        > DESC GO_ic_supra;
        +---------+---------------------------+------+-----+---------+-------+
        | Field   | Type                      | Null | Key | Default | Extra |
        +---------+---------------------------+------+-----+---------+-------+
        | level   | enum('cl','cf','sf','fa') | NO   | PRI | NULL    |       |
        | go      | int(7) unsigned zerofill  | NO   | PRI | NULL    |       |
        | ic      | double                    | YES  |     | NULL    |       |
        | include | tinyint(2)                | YES  | MUL | NULL    |       |
        +---------+---------------------------+------+-----+---------+-------+
        
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The go column is the numeric part of GO id.
    • The ic column shows the infomration content of the GO term.
    • The include column indicates whether or not the GO term belongs to the SPFO. If the column is set to '0' then it is not a member of SPFO. Otherwise, '1' for least informative (i.e., the most general), '2' for moderately informative, '3' for informative, '4' for highly informative (i.e., the most specific).


References

    Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
    Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C. and Apweiler, R. (2009) The GOA database in 2009--an integrated Gene Ontology Annotation resource, Nucleic Acids Res, 37, D396-403. Abstract [ PubMed ]  
    Chothia C, Gough J. (2009) Genomic and structural aspects of protein evolution, Biochem J 419: 15-28. Abstract [ PubMed ]  
    de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method, Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]