Introduction
Identifying surface functional residues
Annotating the structures using sequence information
Annotating the structures using structural information
GO terms annotations
Comparing the annotated structures
References


Introduction

Surface comparison [1,2] is a useful tool for the understanding of biomolecular functions, since it may help in identifying determinants that are not dependent on protein sequence or secondary structure. Protein surfaces are critically involved in selective binding, recognition and interaction with molecular partners, therefore methods for surface comparison may give new insights into protein function analysis. Large scale surface comparison experiments have not been attempted yet, due to the calculation time required. While functional sites are often localized on protein surfaces and the residues involved in the function usually represent a limited set, our goal is the creation of a database of surface patches to be used in an all-versus-all comparison, reducing the calculation time thus focalizing on surface regions of functional interest. It has been shown [3] that functional sites correspond very often to surface clefts and cavities; with an automated procedure (SURFNET, [4]) we have identified such surface cavities for all the proteins with known structure in a non-redundant PDB database, and we have calculated the protein residues that define them, obtaining a collection of patches.
Once such surface regions have been calculated, information about their function can be derived from different sources, such as sequence or functional pattern databases (PROSITE; SWISS-PROT etc.). Moreover, several proteins with known structure have been crystallized together with a ligand (i.e. a substrate, an inhibitor, etc.); this information can be used to annotate patches binding ability.
A new surface comparison method allows the comparison of each patch with the whole patches database, identifying not obvious functional site similarities. Results have been integrated in the annotated patches DB allowing the clustering of annotated and non-annotated patches and the subsequent annotation of the not yet annotated patches sharing a significant similarity with the annotated ones.
The construction of a surface patches databases, together with the possibility to compare them, lead to the identification of similarities not easily detectable with sequence or structure comparison methods, moreover focalizing the attention on putative functional sites. The patches database is here accessible on-line to the users: for each patch a list of similar surface patches is be reported, together with the detected ligand interaction ability and sequence-derived information. This application can therefore be a powerful tool for the analysis of protein functions, considering the great number of uncharacterized structures that is being solved in the structural genomic era.
Thus, our approach and the resulting database and comparison tools may allow:
1) identification of new functional sites on the surface of already characterized proteins;
2) identification of new attractors of not yet identified function in the space of protein surfaces;
3) identification of functional sites by similarity in protein structures solved in structural genomic projects;
4) analysis of new problems manageable with a local protein surface comparison method.

Back to the top

Identifying surface functional residues

We are using a NCBI non-redundant PDB chain set based on the clustering of the chain sequences with a sequence-similarity cutoff corresponding to a BLAST p-value of 10e-7. The list selects 2428 non-redundant chains (1923 X-ray, 504 NMR and one theoretical model). To avoid problems due to the multiple structures proposed for each NMR structure, we use only X-ray solved structures. Of the 504 non-redundancy group with a NMR-solved representative member, 407 groups cluster only NMR-solved structures, while 97 groups contain also X-ray-solved proteins, and we added to our list the X-ray structure with the better structural quality and/or with the greater length. Therefore, we considered 1923 + 97= 2020 X-ray structures (see the complete list) to apply the SURFNET algorithm to identify surface clefts and cavities, that may be functional sites. Using a minimum volume cutoff value, we identify a total of 10175 cavities from 1715 structures (then an average of 5.9 surface cavities per structure), while the other 305 chains do not have sufficiently wide surface clefts (usually these are very small peptides). For each surface cavity we define a surface 'patch' as the residues surrounding the cleft itself, and that may be then involved in surface functions.

Back to the top

Annotating the structures using sequence information

Proteins with similar function often share short regions with similar sequence, called motifs or patterns. The presence of a given pattern in a protein may be used to infer its function. We scan the sequences of the dataset proteins with the motifs in the PROSITE database using the ScanProsite algorithm [5] avoiding those motifs marked as unspecific. Sequences have been determined comparing the sequence stored in the SEQRES field of the PDB files with the list of residues for which coordinates are present in the PDB file, determining in this way which part of the SEQRES sequence has been solved.

Back to the top

Annotating the structures using structural information

The binding ability of a protein represents a clue to infer its functions: very often protein structure is solved in a bound state, using known natural ligands or ligand analogs. We located the bound ligands in our non-redundant dataset of protein chains, and we identified interacting residues scanning the space around each ligand atom. Identifying the common residues between those interacting with the ligand and those belonging to a surface patch allows the annotation of such patch for a binding ability.

Back to the top

GO terms annotations

The Gene Ontology annotation [6] associated to the PDB chains is derived from the flat-file (released in July 2003) by the Gene Ontology Annotation (GOA) project at EBI [7].

Back to the top

Comparing the annotated structures

We used a newly developed protein structure comparison tool (Ausiello et al., manuscript in preparation) to obtain an all-versus-all structural comparison of the surface patches. The algorithm is suitable for this task given its speed and the ability to explore all the combinations of possible aminoacid similarity between two structures in a sequence-independent way, looking for the larger subset of superposable points. Two subsets of aminoacids are considered to match when their superposition can be associated to a low root mean square deviation (rmsd) and a good similarity according to a chosen substitution matrix. Since calculation time is a limiting factor and because of the great number of structural comparisons to be accomplished, the analysis of all the atom coordinates would be intractable. Moreover the usage of the atomic coordinates would impair the superposition of proteins or patches with a different number of atoms (i.e. with different residues). Then, the first step of the procedure is the reduction of the spatial information; this approach is frequently found in the structural comparison tools. Each residue is represented as a pseudo-residue composed of two points: one is the C-alpha, while the other is the side chain atoms center of gravity. The algorithm calculates the distance between each pair of aminoacids for both the query and the target protein, then, in order to find the largest subset of matching aminoacids, the algorithm starts comparing all the possible pairs taking into account the distance and the physico-chemical properties of the aminoacids composing the pairs. When a seed match is found, i.e. a pair of residues in the query protein is similar and is at a similar distance to a pair of aminoacids in the target protein, the algorithm tries to extend the subset by adding a new residue to the seed match. All the aminoacids that lie at less than 7.5 from the previous subset are scanned. At each step, the superposition is evaluated through the rmsd using a cut-off value of 0.7 and through the Dayhoff substitution matrix [8]. The algorithm stops when all the possible combinations of subsets have been explored. The procedure parameters can be set forcing the algorithm to consider only the structural similarities involving at least a given percentage of aminoacids belonging to a functional annotation. This value is usually set to 50%. In this way, we avoid considering similarities in protein regions different from the ones containing the annotated residues, and the found match is more likely to be meaningful. The superposition score represents the highest number of aminoacid pairs that the program has been able to match between the two proteins. The significance of this score is evaluated by calculating the Z-score over the whole score distribution of the query protein comparison against the whole dataset. Using stringent parameters for rmsd and residue similarity, the algorithm is able to compare two proteins of average dimension (about 300 residues) in 10-2 seconds on a 1.0 GHz Pentium III PC with 512 MB RAM.

Back to the top

References

[1]Via, A., Ferre', F., Brannetti, B., Valencia, A. and Helmer Citterich, M. (2000). 3D view of the surface motif associated to the Ploop structure: cis and trans cases of convergent evolution. J.Mol.Biol., 303, 455-465.
[2]Via, A., Ferre' F., Brannetti, B. and Helmer Citterich, M. (2000) Protein surface similarities: a survey of methods to describe and compare protein surfaces. Cell. Mol.Life Sci., 57, 1970-1977.
[3] Laskowski R.A., Luscombe N.M., Swindells M.B., Thornton J.M. Protein clefts in molecular recognition and function. Protein Sci 1996 Dec;5(12):2438-52.
[4] Laskowski R.A. SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions.J Mol Graph. 1995 Oct;13(5):323-30, 307-8.
[5] Gattiker A., Gasteiger E. and Bairoch A. (2002). ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1(2): 107-108.
[6] The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nat Genet 25: 25-29.
[7] Camon E., Magrane M., Barrell D., Binns D., Fleischmann W., Kersey P., Mulder N.,Oinn T., Maslen J., Cox A. and Apweiler R. (2003) The Gene Ontology Annotation (GOA) project: implementation of GO in Swiss-Prot, TrEMBL and InterPro. Genome Research Apr 13(4): 662-72 (2003).
[8] Schwartz R.M. and Dayhoff M.O. (1979). Matrices for detecting distant relationships. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), pp. 353-358, National Biomedical Research Foundation, Washington DC.

Back to the top

Go to the home page