SUPERFAMILY 1.75 HMM library and genome assignments server

Function Annotation of SCOP Domain Superfamilies

Christine Vogel1,2
1MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England
2Institute for Cellular and Molecular Biology, University of Texas at Austin, 2500 Speedway, MBB 3.210, Austin, TX 78712, USA

*Correspondence to: cvogel at mail utexas edu

This document describes function annotation of domain superfamilies. Domains are structural, functional and evolutionary units that form proteins. Domains of common ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in the Structural Classification of Proteins database, SCOP [1,2]. This function annotation of domain superfamilies has been published before[3,4], and we kindly ask you to cite us if you use it. The annotation procedure, as used in those papers, is described below. Recent work [5] updated the function scheme, revised the annotation of eukaryotic superfamilies, and extended it to all SCOP classes a to g.

UPDATE
Christine Vogel has extended her functional annotation of superfamilies to SCOP 1.73. The 1.73 annotation is available in the scop.annotation.1.73.txt file. The functional annotation scheme has not changed. The remainder of this document refers to the SCOP 1.69 annotation.



FUNCTION SCHEME

The exact definition of the 'function' of a protein or domain is still a matter of debate and can vary depending on the exact context. In our work, we annotated domain superfamilies with respect to their usual role in a protein, in a particular pathway or in the cell/organism. Thus, our understanding of 'function' is somewhat a mixture between the definition of 'biological process' and 'molecular function' used in the Gene Ontology [6] annotation.

We prepared a scheme of 50 detailed function categories which map to 7 more general function categories, similar to the scheme used in COGs [7]. The mapping between the detailed and more general function categories is described in Table 1 and scop.larger.categories file. The general categories of function are:

i) Information: storage, maintenance of the genetic code; DNA replication/repair; general transcription/translation
ii) Regulation: regulation of gene expression and protein activity; information processing in response to environmental input; signal transduction; general regulatory or receptor activity
iii) Metabolism: anabolic and catabolic processes; cell maintenance/homeostasis; secondary metabolism
iv) Intra-cellular processes; cell motility/division; cell death; intra-cellular transport; secretion
v) Extra-cellular processes: inter-, extra-cellular processes, e.g. cell adhesion; organismal processes, e.g. blood clotting, immune system
vi) General: general and multiple functions; interactions with proteins/ions/lipids/small molecules
vii) Other/Unknown: unknown function, viral proteins/toxins

We are aware that the members of some superfamilies, particularly the large ones, may have a variety of functions. For example, immunoglobulin domains are involved in cell adhesion, muscle structure, the extra-cellular matrix and the immune system. The function categories here aim to describe the dominant and most wide-spread function for each superfamily, as far as it is known today.



ANNOTATION SCHEME

We annotated each domain superfamily of the SCOP classes a to g manually using the function scheme described above. The annotation was based on information from SCOP [2], InterPro [8,9], Pfam [10], SwissProt [11] and literature.

As a control, we used the automated annotation of GO process, function and location to Pfam domains in InterPro [8]. Pfam domains were mapped onto SCOP domain superfamilies based on sequence similarity. This provided annotation for 647, 667 and 343 domain superfamilies, respectively. The manual domain annotation was largely consistent with the Gene-Ontology annotation [6] for Pfam [12] domains and their mappings to the domains described in SUPERFAMILY [13]. The annotation for large superfamilies. i.e. those that occur in more than ~25 proteins in at least one of the commonly used, completely sequenced eukaryotes, was checked several times by different researchers[5]. We also consulted co-workers on their knowledge about the function of well-known superfamilies. In particular, we thank Matthew Bashton [14], Cyrus Chothia and Madan Mohan Babu for their valuable input.

Based on our experience in working with this annotation, we estimate the error rate to <10% for large superfamilies, and <20% for all superfamilies. If you use the function annotation, please do not hesitate to contact us if you notice erroneous or inappropriate annotation.

The domain function annotation is available in the scop.annotation.1.69.txt file.

Distribution of domain functions

Figure 1 shows the distribution of functions in terms of domain superfamilies in SCOP. Domain superfamilies of metabolism, e.g. enzymes, are the most abundant category. Close to half of all superfamilies (448) have metabolism-related functions, while each of the other categories comprises less than 15% of the domain superfamilies. In human, one third of the superfamilies are metabolic (339/950), mapping to one sixth of all domains (3212/19225)[13]. Some 10% of the superfamilies (122) have unknown functions.

Figure 1. The distribution of domain functions. The distribution of functions of domain superfamilies classes a to g in SCOP version 1.69[2].

Table 1. Mapping between detailed and more general function categories.

The table lists 50 detailed function categories which map to 7 more general function categories. The one- or two-letter code is used in the annotation file. m/tr - metabolism and transport.

General function

Detailed function

Code

Metabolism

Energy

C

Metabolism

Photosynthesis

CB

General

Small molecule binding

HA

General

Ion binding

HB

General

Lipid/membrane binding

HC

General

Ligand binding

HE

General

General

R

General

Protein interaction

RD

General

Structural protein

ST

Information

Chromatin structure

B

Information

Translation

J

Information

Transcription

K

Information

DNA replication/repair

L

Information

RNA processing

LB

Information

Nuclear structure

Y

Metabolism

E- transfer

CA

Metabolism

Amino acids m/tr

E

Metabolism

Nitrogen m/tr

EA

Metabolism

Nucleotide m/tr

F

Metabolism

Carbohydrate m/tr

G

Metabolism

Polysaccharide m/tr

GA

Metabolism

Storage

GB

Metabolism

Coenzyme m/tr

H

Metabolism

Lipid m/tr

I

Metabolism

Cell envelope m/tr

M

Metabolism

Secondary metabolism

Q

Metabolism

Redox

RA

Metabolism

Transferases

RB

Metabolism

Other enzymes

RC

Other

Unknown function

S

Other

Viral proteins

SA

Extra-cellular processes

Cell adhesion

MA

Extra-cellular processes

Immune response

RE

Extra-cellular processes

Blood clotting

RG

Extra-cellular processes

Toxins/defense

SB

Intra-cellular processes

Cell cycle, Apoptosis

D

Intra-cellular processes

Phospholipid m/tr

IA

Intra-cellular processes

Cell motility

N

Intra-cellular processes

Trafficking/secretion

NA

Intra-cellular processes

Protein modification

O

Intra-cellular processes

Proteases

OA

Intra-cellular processes

Ion m/tr

P

Intra-cellular processes

Transport

RF

Regulation

RNA binding, m/tr

A

Regulation

DNA-binding

LA

Regulation

Kinases/phosphatases

OB

Regulation

Signal transduction

T

Regulation

Other regulatory function

TA

Regulation

Receptor activity

HD

N_A

not annotated

NONA

References

1. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536-540. Abstract [ PubMed ]  
2. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32: D226-229. Abstract [ PubMed ]  
3. Vogel C, Berzuini C, Bashton M, Gough J, Teichmann SA (2004) Supra-domains - evolutionary units larger than single protein domains. J Mol Biol 336: 809-823. Abstract [ PubMed ]  
4. Vogel C, Teichmann SA, Pereira-Leal JB (2005) The relationship between domain duplication and recombination. J Mol Biol 346: 355-365. Abstract [ PubMed ]  
5. Vogel C, Chothia C. (2006) Protein family expansions and biological complexity. PLoS Comput Biol. May;2(5):e48. Epub 2006 May 26. Abstract [ PubMed ]  
6. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258-261. Abstract [ PubMed ]  
7. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4: 41. Abstract [ PubMed ]  
8. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 31: 315-318. Abstract [ PubMed ]  
9. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res 33: D201-205. Abstract [ PubMed ]  
10. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34: D247-251. Abstract [ PubMed ]  
11. Boeckmann B, Blatter MC, Famiglietti L, Hinz U, Lane L, et al. (2005) Protein variety and functional diversity: Swiss-Prot annotation in its biological context. C R Biol 328: 882-899. Abstract [ PubMed ]  
12. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, et al. (2004) The Pfam protein families database. Nucleic Acids Res 32: D138-141. Abstract [ PubMed ]  
13. Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32: D235-239. Abstract [ PubMed ]  
14. Bashton M (2004) Functional Analysis of Domain Combinations [PhD]. Cambridge, UK: University of Cambridge.