Introduction Introduction
Surface comparison [1,2] is a useful tool for the understanding of
biomolecular functions, since it may help in identifying determinants that
are not dependent on protein sequence or secondary structure. Protein
surfaces are critically involved in selective binding, recognition and
interaction with molecular partners, therefore methods for surface
comparison may give new insights into protein function analysis.
Large scale surface comparison experiments have not been attempted yet, due
to the calculation time required. While functional sites are often localized
on protein surfaces and the residues involved in the function usually represent
a limited set, our goal is the creation of a database of surface
patches to be used in an all-versus-all comparison, reducing the calculation
time thus focalizing on surface regions of functional interest. It has been
shown [3] that functional sites correspond very often to surface clefts and
cavities; with an automated procedure (SURFNET, [4]) we have identified such
surface cavities for all the proteins with known structure in a
non-redundant PDB
database, and we have calculated the protein residues that define them,
obtaining a collection of patches. Identifying surface functional residuesWe are using a NCBI non-redundant PDB chain set based on the clustering of the chain sequences with a sequence-similarity cutoff corresponding to a BLAST p-value of 10e-7. The list selects 2428 non-redundant chains (1923 X-ray, 504 NMR and one theoretical model). To avoid problems due to the multiple structures proposed for each NMR structure, we use only X-ray solved structures. Of the 504 non-redundancy group with a NMR-solved representative member, 407 groups cluster only NMR-solved structures, while 97 groups contain also X-ray-solved proteins, and we added to our list the X-ray structure with the better structural quality and/or with the greater length. Therefore, we considered 1923 + 97= 2020 X-ray structures (see the complete list) to apply the SURFNET algorithm to identify surface clefts and cavities, that may be functional sites. Using a minimum volume cutoff value, we identify a total of 10175 cavities from 1715 structures (then an average of 5.9 surface cavities per structure), while the other 305 chains do not have sufficiently wide surface clefts (usually these are very small peptides). For each surface cavity we define a surface 'patch' as the residues surrounding the cleft itself, and that may be then involved in surface functions. Back to the topAnnotating the structures using sequence informationProteins with similar function often share short regions with similar sequence, called motifs or patterns. The presence of a given pattern in a protein may be used to infer its function. We scan the sequences of the dataset proteins with the motifs in the PROSITE database using the ScanProsite algorithm [5] avoiding those motifs marked as unspecific. Sequences have been determined comparing the sequence stored in the SEQRES field of the PDB files with the list of residues for which coordinates are present in the PDB file, determining in this way which part of the SEQRES sequence has been solved. Back to the topAnnotating the structures using structural informationThe binding ability of a protein represents a clue to infer its functions: very often protein structure is solved in a bound state, using known natural ligands or ligand analogs. We located the bound ligands in our non-redundant dataset of protein chains, and we identified interacting residues scanning the space around each ligand atom. Identifying the common residues between those interacting with the ligand and those belonging to a surface patch allows the annotation of such patch for a binding ability. Back to the topGO terms annotationsThe Gene Ontology annotation [6] associated to the PDB chains is derived from the flat-file (released in July 2003) by the Gene Ontology Annotation (GOA) project at EBI [7]. Back to the topComparing the annotated structuresWe used a newly developed protein structure comparison tool (Ausiello et al., manuscript in preparation) to obtain an all-versus-all structural comparison of the surface patches. The algorithm is suitable for this task given its speed and the ability to explore all the combinations of possible aminoacid similarity between two structures in a sequence-independent way, looking for the larger subset of superposable points. Two subsets of aminoacids are considered to match when their superposition can be associated to a low root mean square deviation (rmsd) and a good similarity according to a chosen substitution matrix. Since calculation time is a limiting factor and because of the great number of structural comparisons to be accomplished, the analysis of all the atom coordinates would be intractable. Moreover the usage of the atomic coordinates would impair the superposition of proteins or patches with a different number of atoms (i.e. with different residues). Then, the first step of the procedure is the reduction of the spatial information; this approach is frequently found in the structural comparison tools. Each residue is represented as a pseudo-residue composed of two points: one is the C-alpha, while the other is the side chain atoms center of gravity. The algorithm calculates the distance between each pair of aminoacids for both the query and the target protein, then, in order to find the largest subset of matching aminoacids, the algorithm starts comparing all the possible pairs taking into account the distance and the physico-chemical properties of the aminoacids composing the pairs. When a seed match is found, i.e. a pair of residues in the query protein is similar and is at a similar distance to a pair of aminoacids in the target protein, the algorithm tries to extend the subset by adding a new residue to the seed match. All the aminoacids that lie at less than 7.5 Å from the previous subset are scanned. At each step, the superposition is evaluated through the rmsd using a cut-off value of 0.7 Å and through the Dayhoff substitution matrix [8]. The algorithm stops when all the possible combinations of subsets have been explored. The procedure parameters can be set forcing the algorithm to consider only the structural similarities involving at least a given percentage of aminoacids belonging to a functional annotation. This value is usually set to 50%. In this way, we avoid considering similarities in protein regions different from the ones containing the annotated residues, and the found match is more likely to be meaningful. The superposition score represents the highest number of aminoacid pairs that the program has been able to match between the two proteins. The significance of this score is evaluated by calculating the Z-score over the whole score distribution of the query protein comparison against the whole dataset. Using stringent parameters for rmsd and residue similarity, the algorithm is able to compare two proteins of average dimension (about 300 residues) in 10-2 seconds on a 1.0 GHz Pentium III PC with 512 MB RAM. Back to the topReferences
[1]Via, A., Ferre', F., Brannetti, B., Valencia, A. and Helmer Citterich, M.
(2000). 3D view of the surface motif associated to the Ploop structure: cis
and trans cases of convergent evolution. J.Mol.Biol., 303, 455-465. |
Go to the home page