Abstract list 2003
F. Aluffi-Pentini, V. De Fonzo, V. Parisi
A new algorithm for solving differential equations

I. Arisi, Vittorio Rosato, Antonino Cattaneo
Direct and reverse simulations of signal transduction pathways in PC12 cell

P. Ballario, M, Pedicini
An agent based model for the light signal transduction in Neurospora crassa

E. Bartocci, S. Moeller, L. Toldo and E. Merelli
Integration of EnsEMBL with BioAgent

L. Beni, M. Trerotola, L. Antolini and S. Alberti
A protein-DNA recognition code for a-helical transcription factors

B. Berg, G. La Penna, V. Minicozzi, S. Morante, G. Rossi
Multicanonical methods for protein folding

M.F. Blasi, M.Bignami, A.Giuliani, I.Casorelli
A new approach to identify genetic networks using microarrays data

A. Boccia, M. Petrillo, D. Di Bernardo, S. Banfi, A. Guffanti, G. Pesole, G. Paolella
A tool for storage, automated annotation and analysis of Conserved Sequence Tags (CSTs)

B. Brannetti, M. Helmer-Citterich
Comparison of different tools for the prediction of protein interaction specificity

E. Bultrini, E. Pizzi, P. Del Giudice, C. Frontali
Symmetry properties of pentamer usage in non-coding DNA

A. Ceroni, A. Vullo, P. Frasconi
A Combination of Support Vector Machines and Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction

A. Cestaro, S. Tosatto, F. Fogolari, S. Toppo, G. Valle
CASPITA @ CASP5

G. Chillemi, A. Bruselles, A. Desideri
Comparative MD simulations of wild type and T718A mutant in the DNA-human topoisomerase I complex

M.L. Chiusano
Protein Structure and genomic composition

M.L. Chiusano, N. Potenza, R. Del Gaudio, G. M. R. Russo, R. Di Giaimo, T. Mondola and G. Geraci
Structure organization of the innnexin family: integration of computational methods and molecular data

R. Ciarapica, J. Rosati, G. Cesareni, S. Nasi
Molecular recognition in helix-loop-helix and helix-loop-helix-leucine zipper domains: Design of repertoires and selection of high affinity ligands for natural proteins

F. Ciccarelli, C. von Mering, P. Bork
Inferring Novel Functional Links between Different Metabolic Pathways from Genomic Associations

D. Corà, P. Provero, M. Caselle
Finding regulatory elements in eucaryotes: a statistical approach using Gene Ontology

S. Costantini, G. Colonna, A. M. Facchiano
Coeliac disease: studying the interaction of HLA-DQ2 molecule with gluten peptides by computational methods

A. Davassi, M. Petrillo, G. Paolella
Sequence analysis with CAPRI, a web-desktop application

D. D'Elia, P. Leo, G. Scioscia, P. Lopriore, G. Delle Foglie, F. Licciulli,  M. Millot, F. Weighardt, L. Bonfini, R. Lorberth, P. Heinze, G. Van den Eede, M. Attimonelli, H.-J. Buhk
The GMOs Molecular Register: an Integrated Bioinformatic System to support detection/quantification of GMOs

D. di Bernardo, T.S. Gardner, D. Lorenz, J.J. Collins
Reverse Engineering Genetic Networks: a computational and experimental approach

P. D'Ursi, E. Rovida, P. Arosio, I. Zanella
In silico human genome search and classification of H-ferritin-like genes

The ELM Consortium
Eukaryotic linear motifs in the ELM web tool

A. Emerson, S. Liuni, T.Castrignano',  E. Rossi
Plan for a National Infrastructure in Bioinformatics

A. Facchiano Angelo, A. Facchiano, F. Facchiano
Active Sequences Collection (ASC) and a new strategy to identify protein functions

M. Falconi, A. Desideri
Understanding experimental properties of Cu,Zn SODs through molecular dynamics simulation

L. Ferraro, V. Rosato, G. Giuliano
A probabilistic analysis of peptide distribution in proteomes

F. Ferrè, G. Ausiello, A. Zanzoni, M. Helmer-Citterich
SURFACE a web server for annotation of protein functional sites

F. Fogolari, M. Berrera, H. Molinari
Amino acid empirical contact energy definitions for fold recognition in the space of contact maps

F. Fogolari, S. Tosatto, A. Cestaro, G. Valle, H. Molinari
Native loop conformation recognition by MM/PBSA energy calculation

P. Fontana, C. Segala, S. Toppo, C. Moser, S. Grando, G. Valle and R. Velasco
Bioinformatics within the IASMA grape project: tools for data mining and sequences annotation

A. Giuliani, R. Benigni, M. Colafranceschi, I. Chandrashekar , S.M. Cowsik
Large contact surface interactions between proteins detected by time series analysis methods: a case study on C- phycocyanins

G. Grillo, F. Licciulli, S. Liuni, E. Sbisà, G. Pesole
PatSearch: a program for the detection of patterns and structural motifs in nucleotide sequences

A.Guffanti, L.Lassandro, G.Finocchiaro. H.Muller
The Human - Mouse Promoter Machine at IFOM: a tool for retrieval of orthologous promoter sequences from genome sequence data

D. S. Horner and G. Pesole
The estimation of relative site variability among aligned homologous protein sequences

M. Iacono, F. Mignone  and G. Pesole
Genome-wide analysis of the sequence region sorrounding the transcription start site of human mRNAs

C. Lanave, M. Santamaria, C. Saccone
Evolution of gene family in eukaryotes: the BCL-2 gene family

F Lanzarato, G Iazzetti, E Caserta, M. Botta, G Franceschinis, RA Calogero
RRE & ClAW: two new java tools for microarray data mining

M. Lexa, I. Zara, G. Valle
PRIMEX 1.0 and VPCR 2.0: Processing genomic sequence data for efficient and accurate simulation of PCR reactions with genomic DNA as template

P. L. Martelli, P. Fariselli, R. Casadio
An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins

E. Medico, L. D'Alessandro, A. Gentile
Handling global expression data from multiple microarray platforms

L. Montecchi-Palazzi, A. Cabibbo, A. Zanzoni, M. Helmer-Citterich, G. Cesareni
MINT a Molecular INTeraction database

A. Mucci, A. Cusmano, M. De Francisci, M.A. Manniello, D. Marra, P. Romano, G. Mauri
Integration of data from different sources: a prototype devoted to p53 mutations

R. Panteri, A. Paiardini, R. Marino, S. Pascarella, G. D’Arcangelo, F. Keller
Reelin is a heparin binding protein: in vitro  testing and in silico analysis

E. Papaleo, G. Santarossa, M. Vai, P. Fantucci, L. De Gioia
Structural model for Gas1p family members by combined  threading and secondary structure prediction methods

G. Pavesi, G. Mauri, G. Pesole
An Algorithm for Finding Common Secondary Structure Motifs in a Set of Unaligned RNA Sequences

E. Pizzi, E. Bultrini, P. Del Giudice, C. Frontali
Computational analysis of non-coding regions in eukaryotic genomes

L. Pugliese
Annotation of EST sequences by a structural bioinformatics approach

A. Romani, M. Trerotola, E. Guerra, A. Emerson, E. Rossi, A. Bronowska and S. Alberti
Detection and analysis of spliced chimeric mRNAs in sequence databanks

S. Saviozzi, M. Lo Iacono, F. Lanzarato, G. Franceschini, G. La Mantia, V. Calabrò and R.A. Calogero
Analysis of p63 isoform-driven gene expression: a cDNA array/bioinformatics integrated approach

R. Specchio, A. Caprera, J. Hatton, L. Milanesi
GeneGrid: a workflow system for sequences analysis

P. Temussi
The Mechanism of Interaction of Sweet Proteins with their Receptor: Modelling the Complexes

A. Via, M. Helmer-Citterich
A structural study for the optimization of functional motifs encoded in protein sequences

N. Vitulo, A. Cestaro, A. Vezzi, M. D'Angelo, F. Simonato, G. Malacrida, S. Campanaro, G. Valle
Annotation of Photobacterium profundum genome

A. Vullo, P. Frasconi
Disulfide Connectivity Prediction using Generalized Recursive Neural Networks and Evolutionary Information

I. Zara, R. Schiavon, G. Valle
Development of new bioinformatic tools to analyze the HLA genetic system

A new algorithm for solving differential equations - (session: Novel Algorithms for Bioinformatics)

F. Aluffi-Pentini, V. De Fonzo, V. Parisi

Sezione INFM "Tor Vergata"
via della Ricerca Scientifica 1 - 00133 Roma


In some previous meetings of GCB (Bioinformatic Cooperation Group) we have reported about the progress in developing our new algorithm for the numerical integration of large stiff systems of ordinary differential equations (ODE).
The algorithm has a wide utilisation spectrum, including for example the simulation of signal pathways and large metabolic networks (neglecting spatial aspects such as transport and diffusion) in projects involving the virtual cell.
Our method is based on a new approach to the computation of a matrix exponential, includes an automatic correction of rounding errors, is not too expensive computationally, and lends itself to a short and robust software implementation that can be easily inserted in large simulation packages.
The algorithm is largely independent from ill-conditioning and is suitable for any nonlinear problem; moreover, being exact for linear problems, it is especially precise for quasi-linear problems, the most frequent kind in the real world.
The final version of the algorithm is just being published [1]. A preliminary numerical verification has been performed. The paper includes the encouraging results obtained on two sample problems.
The full C listing (including a sample problem) is available as free software, see [1].
Here we describe the main features of the algorithm. The main formulas are reported in fig. 1. Full details can be found in [1].

[1] Aluffi-Pentini F, De Fonzo V, Parisi V. A novel algorithm for the numerical integration of systems of ordinary differential equations arising in chemical problems. J. Math. Chem. In press.
back

Direct and reverse simulations of signal transduction pathways in PC12 cell - (session: Other)

Ivan Arisi(*), Vittorio Rosato($) and Antonino Cattaneo(*§)

(*) LayLine Genomics, via di Castel Romano, 100 ˆ Roma, Italy
($) ENEA, Centro Ricerche Casaccia ˆ Roma, Italy
(§) SISSA Biophysics Department, via Beirut 2-4 ˆ Trieste, Italy

The objective of our work is to formulate a mathematical model of intracellular signal transduction in neuronal cells, in response to extracellular signals, based on protein-protein interaction information from existing databases and literature. The first step has been to describe the propagation of a signal through a model network composed mainly by intracellualr protein kinases and phosphatases and activated by the binding of extracellular ligands to P75, Trk, EGFR, Fas receptors.  The network is composed by N single proteins or protein complexes, each represented by one single node; every node can exist in two states, either active (usually phosphorylated) or inactive (usually dephosphorylated). The links connecting the nodes are the protein-protein interations, which can be unary, binary or multiple, mono or bidirectional. The interactions belong to different typologies: activation/deactivation, binding/unbinding, chemical processing, activation/deactivation of a second interaction, synthesis/degradation.
The model is composed by a set of (2N + Nk) first-order non-linear differential equations in time, in the N variables x(i), concentration of the active form of the protein/complex, the N variables n(i), concentration of the total amount of protein/complex, the Nk  variables K(i,j), the kinetic interaction constants which are linear functions of { x(i) , i=1∑N }.  Space is neglected in the model equations:

d[x(i)]/dt =  Inter_x{x(j),n(j)}  ˆ des(i)*x(i)   , i=1∑.N
d[x(i)]/dt =  Inter_n{x(j),n(j)}   + gen(i) ˆ des(i)*n(i)
d[K(i,j)]/dt = Sum[ Kint(r,i,j)*x(r) ] , r=1∑.N

where (n(i)-x(i)) is the concentration of protein/complex (i) in the inactive state, Ka(j,i) the coefficient for activating connections of protein (j) acting on (i), gen(i) the rate of synthesis, des(i) the rate of degradation.

The form of the expressions Inter_x{x(j),n(j)}  and  Inter_n{x(j),n(j)}  depends upon the types of interaction the node (i) is involved in, for example:

Activation:  Inter_x{x(j),n(j)} = Ka(j,i)*(n(i)-x(i))*x(j)
Disactivation:  Inter_x{x(j),n(j)} = ˆ Kd(j,i)*x(i)*x(j)
Complex aggregation: Inter_x{x(j),n(j)} = Inter_n{x(j),n(j)}    = Kmulplus(i)*[x(k1)*∑*x(knp)]
∑∑∑∑∑∑∑∑.

The model can be used for direct simulations or for reverse engineering of the network. Direct simulations include the usual receptor induced activation of the signalling network, effects of perturbations such as weakening of single interactions or protein knock-out and determination of the output steady state. Effects of a drug on the pathway can be modelled as a new node in the network interacting with its targets.
Most of the kinetic parameters of the model are not available from the existing data repositories, thus to estimate them we realized a reverse engineering of the network by implementing a Genetic Algorithm (GA) on a parallel computational platform. If the unknown values, here the set of kinetic parameters {K(i,j)}, is considered to be the „genome‰ of the system,  the GA exploits the laws of natural selection to generate the „genome‰ able to better fit the given constraints in the system, usually experimental data such as asymptotic concentrations of one or more protein species. Essentially, the GA works by making a large number Ni of „individuals‰ (i.e. replicas of the network), each containing a different „genome‰, „mating‰ in couples once per generation, thus crossing-over and mixing the genomes: the best individuals are selected for the next generation cycle, therefore pushing the sets {K(i,j)} towards values better describing the chosen constraints. Every generation cycle corresponds to performing Ni direct simulations, thus the advantage of the parallelization. The implementation is able to reasonably estimate the unknown parameters (Fig. 1)
We plan to expand the network to include other signalling pathways, genetic interactions and the space component.
back

An agent based model for the light signal transduction in Neurospora crassa - (session: Novel Algorithms for Bioinformatics)

Paola Ballario° and Marco Pedicini*

°DipGenetica e Biologia Molecolare, Università La Sapienza Roma
* Istituto per le applicazioni del calcolo M.Picone.Viale del policlinico 137, Roma
Paola.ballario@uniroma1.it,  marco@iac.rm.cnr.it


We used the relatively simple multiprotein blue light transduction cascade of a filamentous fungus, Neurospora crassa, as subject for the test and development of a formal language able to describe molecular biology dinamic interactions.

The blue light transduction in N. crassa is controlled by two proteins White collar-1 and 2 (WC-1 and WC-2) forming a complex (WCC) that functions as a photoreceptor, (Ballario et al.,Froehlich et al.). WC proteins are involved not only in light perception (through the LOV domain of WC-1 and the chromophore associated) but also in DNA binding and transcriptional activation (by the GATA Zinc finger domains of both proteins).

Many light induced responses are known at phenotipic and molecular biology level in N. crassa. One of the most interesting is the circadian rhythm of conidiation controlled by the protein FRQ that interacts with WCC. Although the blue light signal transduction of N. crassa is far from to be completely elucidated it seems a model suitable for testing the biological expressivity of the language we are developing.

In particular, first we introduce a map model (Khon 1998) for the light regulated signal transduction. Then we translate the map in the core molecular language recently introduced by Danos-Laneve, so providing a process algebra approach to the description of the involved molecular interactions. In this way, we prepare the study of emerging behaviour in the dynamics of the biological system by mathematical and computational tools.

K.Kohn (1999)Molecular biology of the cell,10,2703-2734

P.Ballario et al,(1996) Embo J,15,1650-1657

A.Froehlich et al. (2002) Science,297,815-819

back

Integration of EnsEMBL with BioAgent - (session: Other)

Ezio Bartocci*, Steffen Möller+, Luca Toldo§,
E.Merelli

Dipartimento di Matematica e Informatica, Università di Camerino, +University of Rostock, Proteome Center, Rostock, §Merck KGaA, Pharma Preclinical R&D, Scientific Information Services, Darmstadt


The human, murine and other eukaryotic genomes are presented by the Open Source project EnsEMBL[1,7]. Besides facilitating access through the web or by directly querying the relational database, EnsEMBL may present itself by the BioDAS [2] interface. Moreover, external data sources may be integrated with EnsEMBL seamlessly via the Internet while obeying to BioDAS. The interface facilitates the storage and retrieval of arbitrary properties of the genome, referred to as features that represent the annotation of the genome.
While interacting with EnsEMBL via its web interface, the series of manual interactions with the remote system by the researcher determines the information/experiment flow and it is by his or her mental capacity that the results of multiple experiments are integrated. The challenges of the post-genome area, i.e. with a huge amount of transcriptomics and proteomics data, demand an increased automation of the data analysis process. However, with semantics of publications and their interpretation not being transferable to machines and limited CPU power prohibiting the genome-wide precalculation of algorithms, we search for the directed autonomous execution of in-silico experiments and their communication to and between researchers.
The current bioinformatics approach is to facilitate customised workflows [3,4], which need to be adapted to the respective application, i.e. these are customer and data dependent. Automation starts by employing wrappers and „glue‰ code that parse the inputs/outputs, exploiting so called non-mobile agent technology. Not being mobile has a consequence of restraining bioinformatics algorithms from the use of long lasting computations on remote servers. The WWW information gathering architecture has been designed and implementd for interactive browsing of pre-generated data. Therefore, the choose of a connectionless architecture. The WWW paradigm was extended to access computational engines by  the „Common Gateway Interface (CGI1.2) ‰[8] However, the CGI mechanism can only be exploited if the remote jobs can be performed rapidly or else rely on other mechanisms (e.g. SMTP) or on client polling of a predefined URL to return the result. If a bioinformatics algorithm requires a long lasting job to be executed remotely, then it will grow very much in complexity and is difficultly amenable of workflow mode.The data in bioinformatics may be too huge to be transferred (e.g. raw images of microarrays, of proteomics, or other) or unavailable (e.g. full text of scientific journals) hence, inmobilisation in code also means the limitation on accessing data sources. Furthermore, relevant information may have been carried out within a collaborative research environment, which remains unretrievable by the static web environment.
While EnsEMBL offers a BioDAS interface of the available information sources, it remains difficult to fully exploit them, due to their huge data structure and their dynamic nature. We here describe a first approach to perform the prior mentioned directed automated search by delegating it to specialized software, and to eventually further improve it by employing mobile agents. A mobile agent is a computational unit capable of migrating to different places from any location. An agent can behave in an opportunistic and reactive way. Agents do not require the user's presence and can be assigned a task to be exploited over distributed resources [5]. In the case that EnsEMBL be integrated with agent-based application, one could graphically compare, in the same Contig-View panel, annotations and features extracted from EnsEMBL database with those from generic BioDAS source and with those coming from the agent's task. From this integration we obtain two main advantages
1.    a visual tool to check the result of the agent's task. EnsEMBL puts all BioDAS sources in a graphical format in its contig view, highlighting annotations and features found by the agent in a physical chromosomal position;
2.    a useful way for a bioscientist to compare new results with other BioDAS sources and EnsEMBL annotations.
Furthermore, this integration offers providing a to cache computationally costly results.
Current implementations of DAS sources have certain drawbacks:
    No dynamics. Current BioDAS sources are stateless. This is in a way a direct consequence on the demand for instant replies. Primary data is displayed as available in a pre-computed manner, not its interpretation or context-sensitive information, e.g. by investigating prior results from the cache.
    Limited specification of features in the query. The DAS interface facilitates the query for features within a specific chromosomal range or for specific identifiers, and also allows the retrieval of all features of a source. However, EnsEMBL currently does not offer an interface to present results from BioDAS sources on multiple chromosomes. One could think of an application like a BLAST sequence similarity search and the results of which being presented as a DAS source or of other indirect specifications for a selection of features which is beyond current implementations.
    Inmobility. Moreover, EnsEMBL does not allow the use of remote services within the databases access.
In this work we propose the integration of EnsEMBL with the mobile agent system BioAgent [6]. This was performed by wrapping the BioDAS interface around a single agent for visibility within EnsEMBL. Conversely, the agents may contact any DAS server for information, which includes EnsEMBL itself. The combination of EnsEMBL with the agent system BioAgent gives all advantages of dynamic data generation and data integration to EnsEMBL, in particular of computationally costly algorithms that may not be feasible to precompute. We also addresse an application for inhouse knowledge management, as partial results of analyses and their contexts may be communicated between researchers.

References
1.    Hubbard, T., D. Barker, et al. (2002). "The Ensembl genome database project." Nucleic Acids Res. 30(1): 38-41.
2.    Stein, L.D., Eddy,S., Dowell, R. (1999-2002) „Distributed Annotation System‰  http://www.biodas.org/documents/spec.html.
3.    Möller, S., Schroeder, M., Apweiler, R., (2001). "Conflict-resolution for the automated annotation of transmembrane proteins." Comput. Chem.; 26(1):41-46
4.    Toldo, L., Rippmann, F. (2001) „Method for Determining Nucleic And/Or Amino Acid Sequences‰(Pat WO0120024).
5.   Hall, D., Miller, J., Arnold, J., Kochut, K., Sheth, A., and Weise, M. (1999). "Using Workflow to Build an Information Management System for a Geographically Distributed Genome Sequencing Initiative‰, Genomics of Plants and Fungi, R.A. Prade and H.J. Bohner, Editors .
6.    Merelli, E., Culmone, R. and Mariani, L. (2002) „BioAgent: A Mobile Agent System for Bioscientists‰, NETTAB02-Agents in Bioinformatics, Bologna.
7.    Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Hubbard T, KasprzykA, Keefe D, Lehvaslaiho H, Iyer V, Melsopp C, Mongin E, Pettett R, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Birney E. ≥Ensembl 2002: accommodating comparative genomics‰ Nucleic Acids Res. 2003 Jan 1;31(1):38-42.
8.    Coar K, „The Common Gateway Interface‰ http://cgi-spec.golux.com/.
back

A protein-DNA recognition code for a-helical transcription factors - (session: Structural Genomics)

Laura Beni, Marco Trerotola, Laura Antolini and Saverio Alberti

Mario Negri Sud


The existence of a correspondence code between DNA bases and transcription factor aminoacids is still a matter of debate. A systematic analysis of Zn-fingers-DNA 3D interfaces identified DNA-aminoacid non-bonded contacts and defined a set of rules for peptide-DNA binding. Key to a clear-cut definition of this Œbinding code‚ have been the selection of specific (bases) vs non-specific (ribose, phosphate) bonds, the structural validation of identified non-bonded contacts and the analysis of a statistically meaninful sample population. Comparison with leucine-zippers and helix-turn-helix indicates that these rules extend to other classes of a-helix transcription factors. These rules were validated by a meta-analysis of systematic Zn-finger mutagenesis programs.
back

Multicanonical methods for protein folding

B. Berg, G. La Penna, V. Minicozzi, S. Morante, G. Rossi

Universita' di Roma Tor Vergata
Via della Ricerca Scientifica 00133 - ROMA


We present a variant of the Multi-canonical Monte Carlo method which allows to deal with fully flexible chains of bonded monomers, thus opening the way to the possibility of modelling the presence of a solvent at the fundamental atomic level in the context of folding processes. These results represent an important preliminary step in the direction of simulating folding in realistic conditions.
back

A new approach to identify genetic networks using microarrays data - (session: Other)

M. F. Blasi, M.Bignami, A.Giuliani, I.Casorelli

Istituto Superiore di Sanita, Roma


BACKGROUND and AIM:  DNA repair mechanisms play a vital role in maintaining genetic integrity and it is becoming clear that defects in repair pathways are connected to the pathogenesis of secondary AML (s-AML). In particular defects in mismatch repair (MMR) and in the S-phase checkpoint gene hMRE11 are frequently observed in sAML (Casorelli et al, DNA Repair 2003, 142:1-13).  We are currently studying differential expression levels of a large panel of genes (DNA repair, cell cycle control, cell growth and apoptosis)  in de novo APL versus sAPL. The microarray data set will be analysed by both „descriptive‰ and „simulation‰ approaches. We present some preliminary data on the simulation approach based on a neural network model. This analysis might help to distinguish constitutive genetic networks from circuits linked to contingent situations. MATERIAL AND METHODS: Data were retrieved from published data banks of large scale microarray studies on different cellular systems (38 human AML and ALL, 76 human primary and metastatic adenocarcinomas, time course of cell cycle analysis in synchronized human cells). The correlation coefficients among a set of 50 genes involved in DNA repair (Rad50, Mre11, Brca1, base excision and mismatch repair genes), signaling of DNA damage or cell cycle (ATM, p53, p21, cyclin 1, PML) were fed into a neural network architecture, where the genes represent the nodes and the correlation coefficients the synaptic weights.   Through the analysis of the asymptotic behavior of the network we studied the behaviour of the genes. RESULTS AND DISCUSSION: In the analysis of the microarray data the DNA repair/signaling genes showed relatively low linear correlations among themselves. However, a very robust behavior of the networks was observed in different data bases. This highlights the role of the nonlinear filter constituted by the neural network in identifying genetic circuits. Through this analysis it is therefore possible to identify scenarios where the same genes act in a constitutive or in an inducible way.  Examples of this behaviour will be presented. The possibility to extract a sub-network of interacting genes from the unknown universe of the whole genome, offers a new tool to quantitatively describe genetic regulatory networks.
back

A tool for storage, automated annotation and analysis of Conserved Sequence Tags (CSTs) - (session: Comparative Genomics and Molecular Evolution)

A. Boccia*, M. Petrillo^, D. Di Bernardo°, S. Banfi°, A. Guffanti#, G. Pesole+, G. Paolella*^§

*BIOGEM, Ariano Irpino (AV); ^CEINGE, Napoli; °TIGEM, Napoli; #IFOM, Milano; +Universita’ di Milano; §Universita’ del Molise, CB.


Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach in the identification of coding and functional noncoding sequences. Recently TIGEM initiated a large project which now involves several other institutions including IFOM, CEINGE, BIOGEM, University of Milano and other italian institutions aimed to the identification, automatic annotation and characterization of conserved sequence tags (CST) from about 1000 genes known to be involved in genetically transmitted diseases, through a number of approaches ranging from bioinformatic to laboratory experiments.
Here we report on the development of a database system for collection, storage and automatic annotation of CSTs, which also includes facilities for interrogation and graphic display in the chromosomal context. DNA regions from orthologous human and mouse genes are identified using the BLASTZ program and resulting alignments processed with the Strong-hits program from the PipMaker package. Conserved regions are stored in the CST database, together with annotation data derived from automatic scanning of ENSEMBL, LocusLink, Gene Ontology and other databases. Information regarding CST mapping on the chromosome sequence, relationships to intron exon gene structure, conservation related to taxonomial distribution of genes, similarity to other sequences as found by alignment with various nucleic and proteic datasets using the blast program, expression data shown by matches in EST databases are all easily accessed and searched. A simple, but effective, tool for graphic visualization of CSTS within the gene context is also included, which allows fast browsing along the chromosome. Data from other analysis tools may easily be added; we are currently including information on coding potential, as well as information about general sequence conservation.
Preliminary results of statistical analysis of the CSTs contained in the DB will be reported.
back

Comparison of different tools for the prediction of protein interaction specificity - (session: Other)

Barbara Brannetti and Manuela Helmer-Citterich

Centre for Molecular Bioinformatics, Dept. of Biology, University of Rome Tor Vergata


We decided to compare a set of procedures for their ability to predict the binding specificity of SH3 protein modules starting from their binding peptide lists. We chose to analyze: regular expressions, position weight matrices,  position specific scoring matrices (PSSMs) or profiles and the SPOT procedure (Brannetti et al., 2000, Brannetti et al. 2001).
We first measured the ability of these methods to recognize the peptides able to bind a given SH3 domain in the whole database of SH3-binding peptides. Then we wondered if the information contained within the available peptide sequences is good enough to represent the recognition specificity of an SH3 domain, and tried to identify the best way of handling this information in order to obtain a better correlation with interaction data (natural partners) derived from the MINT database of protein interactions.
To this aim, we used the above mentioned techniques to search for the natural binding partners within the SWISSPROT database, for each chosen SH3 domain.
We measured the performance of each method for a set of SH3 domains, and the predictive power of each matrix applying the ROC analysis. The results on the SWISSPROT database show that all the methods but the regular expressions perform rather well. Details about the different performances will be discussed in the poster.
back
Symmetry properties of pentamer usage in non-coding DNA - (session: Other)

Emanuele Bultrini, Elisabetta Pizzi, Paolo Del Giudice, Clara Frontali

Istituto Superiore di Sanità, Roma


A set of pentanucleotides characterising non-coding regions of a specific genome can be extracted from introns using Principal Component Analysis; this set of words discriminates between introns, their randomised counterparts and exons [1]. The procedure was applied to sequences from different species as C. elegans and D. melanogaster. A genome-wide analysis revealed that the introns' vocabulary usage is typical also of many intergenic portions, constituting a sort of  "background" of each genome, and makes it possible to segment intergenic sequences into intron-like and non-intron-like regions, the latter of which possibly contain functional elements.
The vocabulary is consistently characterised by a symmetry property: it is almost entirely composed of reverse complementary oligos, and the level of symmetry for intron-like sequences is greater than what would be expected from Chargaff's second parity rule; in fact symmetry of DNA sequences is usually observed on large scale, mainly because of statistical reasons [2], while on smaller scale very different results are obtained for introns, that show much higher levels of symmetry than exons. It has been shown that randomised sequences are more symmetrical than the real ones, when a suited symmetry measure is adopted [3], as we also checked in our data sets. However, we show in the present work that intron and intron-like sequences have high levels of symmetry that don‚t arise equally from all pairs of reverse-complementary pentamers, but are mainly due to the small set of vocabulary words.  SOLO POSTER

[1] Bultrini, E., Pizzi, E., Del Giudice, P., Frontali, C., 2003. Pentamer vocabularies characterizing intron and intron-like intergenic tracts from Caenorhabditis elkegans and Drosophila melanogaster. Gene 304, 183-192.
[2]  Prabhu, V. V., 1993. Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 21, 2797-2800.
[3]  Baisnée, P.-F., Hampson, S., Baldi, P., 2002. Why are complementary DNA strands symmetric?
      Bioinformatics 18,1021-1033.

back

A Combination of Support Vector Machines and Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction - (session: Structural Genomics)

Alessio Ceroni, Alessandro Vullo, Paolo Frasconi

Università di Firenze


We present a number of algorithms for improving the prediction of protein secondary structure. Our baseline predictor is based on support vector machines (SVM). Three separate classifiers are trained to discriminate helices, beta sheets, and coils from the rest, respectively. The margins of the three classifiers are subsequently
transformed into conditional probabilities by a multivariate normalized exponential function (softmax) whose parameters are estimated by maximum likelihood.  We propose two different approaches to improve prediction accuracy. First, we train a bidirectional recurrent neural network (BRNN) as a structure-structure filter, i.e. we use the probabilistic SVM predictions as inputs and let the recurrent network to exploit upstream and downstream contextual information to refine predictions. Second, we use a Viterbi decoder (VD) controlled by a finite automaton that encodes prior knowledge on the minimum length of helices and beta sheets.

We validated our methods on a set of 979 proteins from PDB Select, defining a random split with 490 sequences for training and 326 for accuracy estimation. Multiple alignment profiles were generated by running psi-blast on a non-redundant set of proteins' chains. The SVM classifiers used a gaussian kernel and kernel hyperparameters were estimated using the remaining 163 sequences as a validation set.

Using VD on the top of SVM yields a 1.7% relative error reduction on Q3 and a 13% relative error reduction on segment overlap (SOV). Using the BRNN on the top of the SVM yields 5.8% relative error reduction on Q3 and 13.4% relative error reduction on SOV. Finally, the VD on the top of the combination of SVM and BRNN yields a further 3.4% relative error reduction on SOV, while Q3 remains similar. The advantage of the BRNN in the combination is particularly evident for the prediction of beta sheets that improves from 60.4% to 65.7% (a 13.3% relative error reduction). Our current best system achieves Q3=78% and SOV=74%.
back

CASPITA @ CASP5 - (session: Structural Genomics)

A. Cestaro, S. Tosatto, F. Fogolari, S. Toppo, G. Valle

CRIBI, Univ of  Padova


We describe our partecipation at the CASP-5 experiment (2002) under the group name "CaspIta".
The international CASP (Critical Assessment of techniques for protein Structure Prediction) experiments aim at establishing the current state of the art in protein structure prediction, identifying what progress has been made, and highlighting where future effort may be most productively focused [1].
We have developed our own suite of programs ranging from secondary structure prediction [2] to fold-recogntion [3] and modelling [4].
We have submitted both secondary structure predictions and 3D models for all 65 target proteins. Our main interest was focused in targets with biological background information going beyond the simple target to template alignement, e.g. active-site residues known from the literature.
The results for our approach are as follows. For secondary structure prediction we ranked second (SOV) respective first (Q3). For comparative modelling we ranked among the top 20% (CM-only targets), whereas for fold-recognition targets we ranked among top 50%. Over 200 groups partecipated in both categories.
We present some of our better predictions.

[1] CASP5 web site:  http://predictioncenter.llnl.gov/casp5/
[2] Albrecht M. et al. , Protein Engineering, in press. (2003)
[3] Bindewald E. et al. , Protein Engineering, in press. (2003)
[4] Tosatto S.C.E. et al. , Protein Engineering, 15(4):279-286.(2002)

back

Comparative MD simulations of wild type and T718A mutant in the DNA-human topoisomerase I complex - (session: Structural Genomics)

G. Chillemi, A. Bruselles, A. Desideri

CASPUR Consorzio Interuniversitario per le Applicazioni del Supercalcolo per Universita' e Ricerca


Topoisomerase enzymes control the level of supercoiled DNA in cells transiently breaking one or two DNA strands. Recently, we carried out a Molecular Dynamics (MD) study of a reconstituted human topoisomerase I comprising the core and C-terminal domains, in covalent complex with a 22 base pair duplex DNA. The study provided useful information on the role of water in the protein-DNA recognition process and on the collective domain motions of the enzyme (Chillemi et al., 2001; Chillemi et al., 2003). Eukaryotic topo I is the cellular target of the anti-tumor drug camptothecin (CPT), which reversibly stabilizes the cleavable complex, an intermediate in the enzyme's catalytic cycle. Mutation of Thr718 to Ala produces a lethal phenotype that resembles CPT by stabilizing the covalent intermediate between topo I and DNA (Fiorani et al., 1999).

Here we present a MD simulation of human topo I comprising the core, the linker and the C-terminal domains (residues 203-765) in complex with a 22-base pair DNA duplex. Both the wild type enzyme and the human topo I T718A mutant were simulated for 2.2 nanoseconds. The simulations show strong modifications in the dynamics of mutated protein regions far from the mutation site and structural rearrangements of the active site residues that can explain the lethal phenotype of the mutant.

G. Chillemi, P. Fiorani, P. Benedetti, A. Desideri 2003. Protein concerted motions in the DNA-human topoisomerase I complex. Nucl. Acids Res., 31: 1525-1535.

G. Chillemi, T. Castrignanò, A. Desideri 2001. Structure and hydration of the DNA- Human Topoisomerase I covalent complex. Biophys. J., 81: 490-500.

Fiorani, P., J.F. Amatruda, A. Silvestri, R.H. Butler, M-A. Bjornsti, and P. Benedetti. 1999. Domain Interactions Affecting Human DNA Topoisomerase I Catalysis and Camptothecin Sensitivity. Mol. Pharmacol., 56:1105-1115.

back

Protein Structure and genomic composition - (session: Structural Genomics)

Maria Luisa Chiusano

Dip. Genetica, Biologia Generale e Molecolare, Napoli


Composition is one of the component of the structural organization of the genomes. In particular, vertebrate genomes show compositionally homogeneous DNA segments known as isochores (Bernardi, 2000). The evidences that relate GC content of a gene with the encoded protein properties, in terms of amino acid content, imply that coding region compositions may also affect protein structure properties. The study of the relationships between coding regions compositional features and the structures of the encoded proteins, especially in the light of the typical compositional properties of secondary structures (Chiusano et al., 2000) evidentiate still unrevealed features. Fluctuations of the synonymous and non-synonymous substitution rates of mammalian genes were found to correlate with the secondary structure (alpha-helix, aperiodic, beta-strand) of the encoded proteins (Chiusano et al., 1999). Moreover, specific nucleotide composition were observed in the three codon positions corresponding to a given protein secondary structures, with strong implications on the origin of the genetic code organisation. Different data sets were taken in consideration so that to analyse the relationships between coding region compositions and the properties of the encoded proteins.

Bernardi G., (2000). The compositional evolution of vertebrate genomes. Gene 259: 31-43.

Chiusano M.L., D‚Onofrio G., Alvarez-Valin F., Jabbari K., Colonna G., Bernardi G. (1999). Correlations of nucleotide substitution rates and base composition of mammalian coding sequences with protein structure. Gene 138: 23-31.

Chiusano M.L., Alvarez-Valin F., Di Giulio M., D‚Onofrio G., Ammirato G., Colonna G. and Bernardi G. (2000). Second codon positions of genes and the secondary structures of proteins: implications for the origin of the genetic code. Gene 261: 63-69.

back

Structure organization of the innnexin family: integration of computational methods and molecular data - (session: Comparative Genomics and Molecular Evolution)

Chiusano Maria Luisa, Potenza Nicoletta, Del Gaudio Rosanna, Giuseppina Maria Rosaria Russo, Di Giaimo Rossella, Mondola Tiziana and Geraci Giuseppe

Dip. Genetica, Biologia Generale e Molecolare, Napoli


Innexins are a family of membrane proteins involved in the formation of gap junctions in invertebrates. They have been found to participate in several aspects of cell differentiation and in the production of embryonic structures through the formation of specific intercellular channels. These proteins appear to be ubiquitous because, from the discovery in D. melanogaster (Lipshitz and Kankel, 1985) and in C. elegans (Starich et al., 1993), it has been shown that they are present in the mollusc Clione limacina and in the flatworm Girardia tigrina (Panchin et al., 2000), in the annelid Hirudo medicinalis (Alexopoulos et al., 2000) and in the polychaete annelid worm Chaetopterus variopedatus (Potenza et al., 2002). Moreover, several genes encode proteins of this family in each species (Curtin et al., 1999). As an example, the genome of D. melanogaster encodes at least ten different proteins (Stebbings et al., 2002), while C. elegans encodes twentyfive different innexins. Moreover, these proteins are not interchangeable in their function (Curtin et al., 2002).
We present here a computational analysis of this family of proteins. Genes belonging to innexin family were collected from Genbank and their structures were aligned with the corresponding protein sequences and with the information derived from predictive methods to localize transmembrane regions. A phylogenetic analysis of all known innexins was performed on the protein sequences resulting in a tree where insects and other invertebrate innexins are in distinctive clusters when compared to the nematode. While the comparative analysis of the proteins shows similarity at the level of the structural organization and in the clustering of data from the same species, there is a high heterogeneity at amino acid level and at the level of the gene structures. These differences are evident even for genes that are in close contiguity on the same chromosome.
The computational methodology reported, consisting in the comparison between different types of structural data, reveals unexpected features also when applied in the study of the organization of multigene families (Chiusano, 1999; Chiusano, 2000).

1) Lipshitz HD, Kankel DR (1985) Specificity of gene action during central nervous system development in Drosophila melanogaster: analysis of the lethal (1) optic ganglion reduced locus. Dev Biol 108(1): 56-77
2) Starich TA, Herman RK, Shaw JE (1993) Molecular and genetic analysis of unc-7, a Caenorhabditis elegans gene required for coordinated locomotion. Genetics 133: 527-541
3) Alexopoulos H, Dykes IM, Bacon JP, Davies JA (2000) Novel innexins in snails and leeches. Eur J Neurosci 12 (Suppll 11): 15
4) Potenza N, del Gaudio R, Rivieccio L, Russo GMR, Geraci G (2002) Cloning and molecular characterization of the first innexin of the phylum Annelida. Expression of the gene during development. J Mol Evol 54: 312-321
5) Curtin KD, Zhang Z, Wyman RJ (1999). Drosophila has several genes for gap junction proteins. Gene 232(2):191-201.
6) Stebbings LA, Todman MG, Phillips R, Greer CE, Tam J, Phelan P, Jacobs K, Bacon JP, Davies JA (2002) Gap junctions in Drosophila: developmental expression of the entire innexin gene family. Mechanisms of Development 113: 197-205
7) Curtin KD, Zhang Z, Wyman RJ, (2002). Gap junction proteins are not interchangeable in development of neural function in the Drosophila visual system. J Cell Sci 115:3379-88
8) Chiusano M.L., D‚Onofrio G., Alvarez-Valin F., Jabbari K., Colonna G., Bernardi G. (1999). Correlations of Nucleotide Substitution Rates and Base Composition of Mammalian Coding Sequences with Protein Structure. Gene 238, 23-31.
9) Chiusano M.L., Alvarez-Valin F., Di Giulio M., D'Onofrio G., Ammirato G., Colonna G.and Bernardi G. (2000). Second codon position of genes and the secondary structures of proteins: relationships and implications for the origin of the genetic code. Gene 261, 63-69

back

Molecular recognition in helix-loop-helix and helix-loop-helix-leucine zipper domains: Design of repertoires and selection of high affinity ligands for natural proteins - (session: Structural Genomics)

Ciarapica R, Rosati J, Cesareni G, Nasi S.

Istituto di Biologia e Patologia Molecolari CNR, Università La Sapienza, Roma 00185.


Helix-loop-helix (HLH) and helix-loop-helix-leucine zipper (HLHZip) are dimerization domains, which mediate selective pairing among members of a large transcription factor family involved in cell fate determination. To investigate the molecular rules underlying recognition specificity and to isolate molecules interfering with cell proliferation and differentiation control, we assembled two molecular repertoires obtained by directed randomization of the binding surface in these two domains. For this strategy we have selected the Heb HLH and Max Zip regions as molecular scaffolds for the randomization process and we have displayed the two resulting molecular repertoires on lambda phage capsids. By affinity selection, many domains were isolated that bound to the proteins Mad, Rox, MyoD and Id2 with different levels of affinities. Although several residues along an extended surface within each domain appeared to contribute to dimerization, some key residues critically involved in molecular recognition could be identified. Furthermore, a number of charged residues appeared to act as switch points that facilitate partner exchange. By successfully selecting ligands for four out of four HLH or HLHZip proteins, we have shown that the repertoires that we have assembled are rather general and possibly contain elements that bind with sufficient affinity to any natural HLH or HLHZip molecule. Thus they represent a valuable source of ligands that could be used as reagents for molecular dissection of functional regulatory pathways.
back

Inferring Novel Functional Links between Different Metabolic Pathways from Genomic Associations - (session: Comparative Genomics and Molecular Evolution)

Francesca Ciccarelli, Christian von Mering, Peer Bork

EMBL, Heidelberg, Germany


The way the metabolic pathways are represent is usually based on information derived from biochemical and physiological studies, which mainly focus on the direct interactions between metabolites but tend to miss the wide-ranging connections. New and intriguing associations between apparently unrelated proteins are now inferable from comparative analyses of localization and distribution of the corresponding genes in different genomes. Inferring functional associations between proteins is based on the evidence that, during evolution, genes functionally associated are subjected to a similar selection pressure. As a result of this pressure, they tend to have the same species distribution (same phylogenetic profile), to be located in close proximity within the genome, particularly in prokaryotes (gene neighborhood) or to be fused together (gene fusion). STRING is an integrated database, which collects all the genomic evidences for functional links between proteins. Additionally, it offers a benchmarked scoring scheme to integrate the different association evidences and gives a confidence value for each prediction. Using the STRING collection as the starting database of gene associations, we analyzed all the predicted links between two Clusters of Orthologous Groups (COGs) associated to two unrelated metabolic pathways as mapped in the KEGG database and with a confidence value threshold of 0.700. The thus defined dataset comprises 132 associations, 54 of whose are already known in literature but not reported in the KEGG database, 40 can be attribuited to a lack of resolution in the orthology detection, and 38 are novel and previously undescribed associations. Interestingly, the 38 binary associations can be organized in 8 main groups, each linking two pathways. From the analysis of these 8 novel connections, new and interesting hyphotheses on metabolic pathways interactions and evolution can be derived.
back

Finding regulatory elements in eucaryotes: a statistical approach using Gene Ontology - (session: Other)

Davide Corà, Paolo Provero, Michele Caselle

Universita' degli Studi di Torino


The discovery of regulatory elements in eucaryotes is one of the most important challenges of bioinformatics. In this poster we present a method in which sets of genes, generated from statistical determination of overrepresented motifs,  are correlated with the prevalence of Gene Ontology annotation terms.
POSTER ONLY
back

Coeliac disease: studying the interaction of HLA-DQ2 molecule with gluten peptides by computational methods - session: Structural Genomics)

Susan Costantini1§, Giovanni Colonna1, Angelo M. Facchiano1,2

1 CRISCEB – Research Center of Computational and Biotechnological Sciences, Second University of Naples
2 Institute of Food Science and Technology, CNR, Avellino
§ PhD fellowship supported by E.U.


Coeliac disease is the most common food-sensitive enteropathy in humans, caused by a permanent intolerance for the dietary gluten and is considered to be a T cell-mediated multifactorial disease.The large majority of patients express the HLA-DQ2 [DQ(a1*0501,b1*02)] and/or HLA-DQ8 [DQ(a1*03,b1*0302)] molecules. The DQ2 and DQ8 molecules confer susceptibility to celiac disease by presenting gluten-peptides to T-cells in the small intestine. The peptide-binding motifs of DQ2 and DQ8 show a preference for negative charges at anchor positions of the bound peptides. The gluten proteins contain few negative charged residues and are rich of proline and glutamine. Experimentally it has been found that lesion-derived T cells recognize deamidated gluten peptides and that this deamidation can be mediated in situ by the transglutaminase type 2 enzyme.
In our work, we have predicted the three-dimensional structure of  HLA-DQ2 molecule by homology modelling and of the complex of HLA-DQ2 with some gluten peptides by superimposition and energy minimization on the basis of the experimental structure of  DQ8-insulin B9-23 complex (PDB code: 1JK8). Moreover, we have evaluated the energies of interaction for each peptide/DQ2 complex. We have studied the peptide binding motif for DQ2 and observed that this molecule has the preference for negatively charged residues in the specific anchor positions. Infact, when a single glutamine residue in these anchor positions is exchanged with glutamic acid, the peptide-DQ2 complex appear more stable. This finding is in agreement with the experimental results reported in literature.
We are applying this strategy to evaluate the effects of other peptide modifications as well as the interaction of DQ2 with other peptides. SOLO POSTER
back

Sequence analysis with CAPRI, a web-desktop application - (session: Others)

A. Davassi+, M. Petrillo*, G. Paolella+*°

+ DBBM,  University of Napoli; * CEINGE, Napoli; °Dipartimento SAVA, Universita’ del Molise


CAPRI (Common Application Program Remote Interface) is a new interface tool, developed in our laboratory and used in our site to create a single consistent access model to a large number of sequence analysis tools. The tool uses any web browser to accurately reproduce the behaviour of a typical desktop application, where the user sees a sequence or other type of document  in a window and chooses from a number of menus the various functions.that need to be applied. The menus work exactly as in a standard interactive Mac or Windows application and allow only the options which are relevant to the type of data selected, hiding or disabling the rest.This mechanism allows to present the user with  a large number of programs, when needed, but to keep the interface clean otherwise. The linked programs are always easily identified, and the original documentation is made readily available.
Document windows may represent DNA pages containing one, two or more sequences. The page automatically recognizes the number of sequences used and changes the menus accordingly, for example activating all pairwise sequence alignment programs only when in two-sequence mode. A protein window behaves similarly, but provides protein analysis tools. A typical dialog box is used to gather additional information from the user when options are available in the underlying program. Standard ‘File’ and Edit’ menus are also implemented, but being a server based program, files may be loaded or saved either from the local client or from the remote file system. Additional options from the ‘File’ menu allow direct retrieval of data from EMBL or other databases.
The program is based on the concept of a virtual ‘application memory’, kept on the server, which stores all the information relative to the ‘running’ application, including sequence data and user preferences, but is able to access the user directory for data storage and retrieval. All databases and other centrally mantained data are seen by the user as part of the application. Large sequences are only kept on the server, thus allowing analysis of very large ones without delay even from a remote location. The program uses an object model to link external programs as modules. It is able to interface with PISE (C. Letondal, Bioinformatics, 17(1), 2001, pp 73-82), with which it nicely coexists on the same server, sharing the same XML program descriptions and the program definition objects. CAPRI is mainly developed in Perl, and uses client-side Javascript routines. At the moment CAPRI includes most programs from the EMBOSS package, Blast, FastA and ClustalW.
Current work on the project is aimed to further expand the number of linked programs, as well as to introduce new page types for dealing with other biological data.
back

The GMOs Molecular Register: an Integrated Bioinformatic System to support detection/quantification of GMOs - (session: Database: Ontology and Integration)

 D. D'Elia1, P. Leo3, G. Scioscia3, P. Lopriore3, G. Delle Foglie1, F. Licciulli1,  M. Millot4, F. Weighardt4, L. Bonfini4, R. Lorberth5, P. Heinze4, G. Van den Eede4, M. Attimonelli2, H.-J. Buhk5

1) Istituto di Tecnologie Biomediche - CNR, Sezione di Bioinformatica e Genomica, Via Amendola 168/5, 70126 Bari – Italy - 
2)Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Via Orabona 4, 70125 Bari – Italy -
3)Java Technology Center, IBM SEMEA Sud, via Tridente 42/14, 70125  Bari – Italy -
4)European Commission, Joint Research Centre, Institute for Health and Consumer Protection,        Biotechnology andGMOs Unit, TP331, I-21020 Ispra (VA), Italy -
5)Robert Koch-Institut, Centre for Gene Technology Nordufer 20, D-13353 Berlin, Germany


In July 2000, the European Network of GMO laboratories (ENGL) has been created; its official inauguration took place in Brussels on December 4, 2003. The development of a Molecular Register (GMOs MOLREG package) has been planned. The Molecular Register will contain data on molecular characterization of GMOs approved for placing on the market in the EU and the necessary on-line tools to analyse the related sequences. This project was considered of highest priority since such a kind of Bioinformatics Instruments is extremely necessary to support the ENGL in the detection and characterisation of engineered genetic constructs.
Collaboration agreement contracts for the development of the GMOs MOLREG package have been signed between the Institute for Health and Consumer Protection of the Joint Research Centre and:
-    the Section of Bioinformatics and Genomic of the ITB –Consiglio Nazionale delle Ricerche (CNR), Italy
-    the Robert Koch-Institut (RKI), Zentrum Gentechnologie, Germany
The CNR lead the design and development of the GMOs MOLREG package as an easy to use web-based system consisting of a database storing GMOs data, a GMOs data submission component, a query component and a Bioinformatic Tool for running GMO’s bio-sequence analysis. The design of the GMOs MOLREG database itself and the development of the web-submission component has been done by the CNR in collaboration with RKI and the JRC, while to design and develop the GMO Query component and the GMO bioinformatics analysis component the CNR established a collaborative agreement with the Life Science Team of the IBM Semea Sud, a specialized services unit of IBM in Italy.
The database has been designed and developed to include information on the following aspects: general administrative/legislative information on the registered GMO, specific information on the origin of the GMO including a detailed molecular characterisation, detailed information on available GMO detection methods and certified reference materials, screening, identification, quantification methods and information on reference literature.
The need for running bioinformatics analysis on GMOs bio-sequences required that the Registry was deeply integrated with Bioinformatics analysis tools, using a novel approach based on XML, giving end-users the possibility to analyse molecular data stored into the Repository and/or correlate such data with external public biological databases (EMBL, SWISS-PROT, PROSITE, REBASE, TRANSFAC, UNIVEC). At this purpose specific bioinformatics algorithms were required to be integrated in the GMO Repository information infrastructure; they are some of the analysis program of the EMBOSS package, besides BLAST and FASTA.
An important aspect concerning the integration between the GMOs MOLREG database and the external bioinformatic analysis tools is that a specific XML-based tools description language has been developed in order to generalize the runtime-generated web user interface, both to take tool input parameters and to show the tool results. This XML language provides a mechanism to have independence between the GMOs MOLREG database and the underlying bioinformatics analysis package used. Additionally, a web administrative console that manipulates the XML descriptors was also created: to manage analysis programs updates; to attach, as a sort of plug-in, new bioinformatics programs.
Strong security features (crypting, profiling authorization and control, certificates, etc..) were also implemented in order to protect the confidential business information flow during all the steps of the extraction/analysis process.
The system has been developed in Java by using the IBM Websphere Studio Application Developer 4.0.3 environment and deployed and successfully tested on Websphere Application Server, Oracle Application server as well as on Tomcat.
The major benefit coming from the development of such a kind of Bionformatic System is that it significantly reduces the times to perform bio-sequences retrieval and analyses through the user-friendly web query and processing system that helps and guides end-users to find the data and the analyses that best fit to their need with the implementation of a workflow logic. This has a great importance in these time-consuming operations in which the researcher has to perform data retrievals and analysis manually, in many steps.
Moreover, a great flexibility is provided by this application that doesn't need, as the customer requested, programming skills to be maintained: the XML interface to the analysis tools, with the administrative console, allows to simply follow the frequent updates that affect these tools by automatically changing the XML descriptors, without any additional Java source modification.
Acknowledgement: This project has been funded within the grant n.17356-2000-12 F2SC ISP IT signed between CNR (scientific responsible M. Attimonelli) and JRC (scientific responsible G. Van den Eede). Moreover funds from Progetto MURST Cluster C03/2000, CEGB have contributed to the project.
back

Reverse Engineering Genetic Networks: a computational and experimental approach - (session: Novel Algorithms for Bioinformatics)

D. di Bernardo, T.S. Gardner, D. Lorenz, J.J. Collins

TIGEM, Napoli


Genes, proteins and metabolites are organized into extensive networks that enable a cell to respond, adapt and communicate with its evironment. The extent and complexity of such networks can hinder attempts to elucidate their structure and function. To address this problem, we have developed an approach that uses systematic transcriptional perturbations to construct a first-order model of a gene and protein regulatory network. We applied this method to a nine-gene subnetwork of the SOS pathway in Escherichia coli and obtained an accurate model of the regulatory interactions. Using the recovered model, we correctly identified the major regulatory genes and the genes that directly mediate Mitomycin C activity in the subnetwork. This approach, which is experimentally and computationally scalable, provides a novel framework for elucidating the functional properties of genetic networks and identifying the mechanisms of action of pharmacological compounds.
back

In silico human genome search and classification of H-ferritin-like genes - (session: Comparative Genomics and Molecular Evolution)

Pasqualina D‚Ursi*, Ermanna Rovida*, Paolo Arosio#,Isabella Zanella#

*Istituto di Tecnologie Biomediche, CNR, Via F.lli Cervi 93, 20090 Segrate (Mi), Italy
#Dipartimento Materno Infantile e Tecnologie Biomediche, University of Brescia, Viale Europa 11, 25123 Brescia, Italy


Similarity search analysis of EST database, with H-ferritin cDNA, has recently led to the identification of an intronless gene, encoding a protein similar to H-ferritin (FTH) but with a long N-terminal extension for mitochondrial export. The mature form of this mitochondrial ferritin is about 80% identical to cytosolic H ferritin.(Levi S, Corsi B, Bosisio M, Invernizzi R, Volz A, Sanford D, Arosio P, Drysdale J. A human mitochondrial ferritin encoded by an intronless gene. J Biol Chem  2001 Jul 6;276(27):24437-40 ).
This finding raises the possibility that other H-ferritin-like DNA sequences might encode functional genes. To this aim we performed systematic in silico studies on the human genome. BLAST analysis using full H-ferritin cDNA sequence identified 29 DNA fragments with a E-values < 0.0001, used a threshold. After sequence alignment, they could be separated into three blocks. The first includes 14 sequences, which overlap more than 70% of the full FTH-cDNA, which belong to the category of processed pseudogenes.  They had > 88% identity to the query and show few gaps or substitutions that in most cases disabled the expression of a functional protein. However, two of them have potential ORFs encoding protein sequences of the same size and highly homologous to FTH (8 and 11 substitutions, respectively), and one has a potential ORF encoding a longer sequence with an N-terminal extension. None of these sequences were represented in EST database, and all showed polyA stretches at 3‚ and/or repeat flanking regions.  This indicates that they represent non-functional pseudo genes. The second group of 7 components include sequences, which overlap 20-60% of FTH-cDNA with identity between 80 and 85%.  All potential ORF carried disabling mutations and none of them was represented in EST database. We concluded that they represent non-functional pseudogenic fragments. The third group was composed by sequences which overlapped about 50% FTH-cDNA with 70-80% identity. It included the previously described mitochondrial ferritin MtF, on chromosome 5, and five sequences on chromosome X. One of them (FTHL17) was already described and found expressed in spermatogonia and encodes a peptide of 183 amino acids (Wang PJ, McCarrey JR, Yang F, Page DC. An abundance of X-linked genes expressed in spermatogonia. Nat Genet 2001 Apr;27(4):422-6). The other sequences are located  in close proximity, one encodes for a peptide of 158 residues and the other ones are characterized by N-terminal extensions of 30-70 amino acids. Interestingly, in none of them the residues of the ferroxidase centre are fully conserved. They have high similarity to sequences present in EST database, lack polyA stretches and repeat flanking regions, and might be functional genes. Some of these DNAs have been cloned in expression vectors, and work is in progress to study structure and expression of the corresponding proteins.
back
Eukaryotic linear motifs in the ELM web tool - (session: Database: Ontology and integration)

Pål Puntervoll2, Rune Linding1, Christine Gemünd1, Sophie Chabanis-Davidson1, Morten Mattingsdal2, Scott Cameron3, David M. A. Martin3, Gabriele Ausiello4, Barbara Brannetti4, Anna Costantini4, Fabrizio Ferrè4, Vincenza Maselli4, Allegra Via4, Gianni Cesareni4, Francesca Diella5, Giulio Superti-Furga5, Lucjan Wyrwicz6, Chenna Ramu1, Caroline McGuigan1, Rambabu Gudavalli1, Ivica Letunic1, Peer Bork1, Leszek Rychlewski6, Bernhard Kuster5, Manuela Helmer-Citterich4, William N. Hunter3, Rein Aasland2 and Toby J. Gibson1

1European Molecular Biology Laboratory
2Department of Molecular Biology, University of Bergen, Norway
3Division of Biological Chemistry and Molecular Microbiology, University of Dundee, UK
4Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Rome, Italy
5Cellzome AG, Heidelberg, Germany
6BioInfoBank Institute, Poznan, Poland


Reflecting the modular nature of eukaryotic proteins, several WWW servers (e.g. PFAM, SMART, PROSITE) are dedicated to revealing domains in protein sequences. However, there is no resource, which specifically focuses on short functional motifs (targeting peptides, docking modules, glycosylation sites, phosphorylation sites, etc), yet these modules are just as important for function as the larger protein domains.
Domains are identified by conventional methods, such as patterns (regular expressions) profiles or HMM models. But statistically robust methods cannot usually be applied to small motifs, while pattern-based methods over-predict enormously so that the few true motifs are lost amongst the many false positives.
ELM (Eucariotic Linear Motifs - http://elm.eu.org) is a new web based tool for the prediction of this small motifs on eukaryotic protein sequences. At the moment, the ELM database contains manually curated information about more than 80 known linear motifs in the form of regular expressions, profiles or hidden markov models that identify the motifs on the sequence. ELM addresses the over prediction deficiency of other methods by the use of context-based rules and logical filters that exclude false positives. Filters work by comparing the information on the motifs stored in the db (taxonomic, structural and cellular context) with the information submitted by the user together with his sequence. Stuctural filters work by automatically modelling the submitted protein sequences, whenever a good template is found in the PDB, and comparing different parameters (such as the predicted solvent accessibility, temperature factors, secondary structure) with the values associated to the ELM, which are stored in the db .
back

Plan for a National Infrastructure in Bioinformatics - (session: Other)

A. Emerson1, S. Liuni2, T.Castrignano'3,  E. Rossi1

1 - CINECA-Bologna, 2- CNR-Bari, 3- CASPUR- Roma


This presentation is about a new activity recently started within a FIRB project funded by the Italian Ministry for Research.
The main objective is to design and implement a shared structure for collaborative work and resource access aimed at supporting researchers in the ambits of structural and functional Genomics and Proteomics.
It will be based on tools for biocomputing and analysis of sequences and structures for nucleic acids and proteins.
This infrastructure intends to be logically distributed in order to integrate several already existing structures located all over the Country, adding new facilities (computing and storage), linking together the collaborating laboratories and being compatible with other European and International Bioinformatics infrastructures.
back

Active Sequences Collection (ASC) and a new strategy to identify protein functions - (session: Novel Algorithms for Bioinformatics)

Facchiano Angelo (*), Facchiano Antonio (+), Facchiano Francesco (+)

(*) Istituto di Scienze dell'Alimentazione, CNR-via Roma 52A/C 83100 Avellino, Italy ˆ angelo.facchiano@isa.cnr.it
(+)  IDI, Istituto Dermopatico dell'Immacolata-via Monti di Creta 104, 00167 Roma, Italy.


We have recently published a paper (1) describing the Active Sequences Collection (ASC), a database of short sequences, peptides or protein segments, with a demonstrated biological activity. The current version of ASC consists of three sections: DORRS, a collection of active RGD-containing peptides; TRANSIT, a collection of protein regions active as substrates of transglutaminase enzymes (TGase), and BAC, a collection of short peptides with demonstrated biological activity. ASC is aimed to create a new strategy to hypothesize biological functions of a protein. Biological activity signals may be identified by analysis of protein families, and may consist of conserved segments of sequence, sequence patterns, or conserved but not contiguous amino acids. When the active region is a very short sequence, it may be difficult to found it by means of searches against large protein databases, which return large outputs, without specific notes about functional regions. ASC database collects only biologically active segments, thus the search of a protein sequence against ASC may offer advantages into the identification of potential biologically active regions.
A public version of ASC database is available at the web address http://crisceb.unina2.it/ASC/

1) Facchiano AM, Facchiano A, Facchiano F. „Active Sequences Collection (ASC) database: a new tool to assign functions to protein sequences.‰ Nucleic Acids Res 2003 Jan 1;31(1):379-82
back

A probabilistic analysis of peptide distribution in proteomes - (session: Comparative Genomics and Molecular Evolution)

Luca Ferraro, Vittorio Rosato, Giovanni Giuliano

Centro Ricerche Casaccia, Unita' di Biotecnologie, Roma


We analyzed 28 complete proteomes (12 archaeal, 10 bacterial and 6 eukaryal). We defined the "representation" value Cr of a peptide in a given proteome as the ratio between its measured and expected occurrences (the latter evaluated on the basis of its AA composition). A probability value P(Cr) is then evaluated on the basis of the assumption of a Poissonian distribution for the Cr.
Over-represented peptides with a P < 0.05 in a series of proteomes were selected. BLAST analysis shows that these peptides can be considered "motifs" in the sense that they retrieve, in different proteomes, proteins of similar function.
Data will be shown also on
(a) the dispersion of these peptides in proteomes;
(b) the effects of evolutionary and environmental cues on their distribution;
(c) their coincidence with previously described protein motifs (Prosite data-base).
back

SURFACE a web server for annotation of protein functional sites - (session: Structural Genomics)

Fabrizio Ferrè, Gabriele Ausiello, Andreas Zanzoni and Manuela Helmer-Citterich

Centre for Molecular Bioinformatics, Dept. of Biology, University of Rome Tor Vergata, Rome (Italy)


SURFACE (SUrface Residues and Functions Annotated, Compared and Evaluated) is a DataBase of annotated and compared regions of protein surface. SURFACE contains the results of a large-scale protein annotation and local structural comparison project. A non-redundant set of protein chains is used to build a database of protein surface patches, defined as putative surface functional sites. Each patch is annotated with sequence and structure-derived information about function or interaction abilities. A new procedure for structure comparison is used to exert an all-versus-all patches comparison. Selection of the results obtained with stringent parameters offers a similarity score that can be used to associate different patches and may allow reliable annotation by similarity.
Annotation exerted through the comparison of regions of protein surface allows to highlight similarities, which cannot be recognized by other methods of sequence or structure comparison.
back

Amino acid empirical contact energy definitions for fold recognition in the space of contact maps - (session: Structural Genomics)

F. Fogolari, M. Berrera, H. Molinari

Universita' di Verona


Contradicting evidence has been presented in the literature concerning the effectiveness of empirical contact energies for fold recognition. Empirical contact energies are calculated on the basis of information available from selected protein structures, with respect to a defined reference state, according to the quasi-chemical approximation. Protein-solvent interactions are estimated from residue solvent accessibility.

In the approach presented here contact energies are derived from the potential of mean force theory, several definitions of contact are examined and their performance in fold recognition is evaluated on sets of decoy structures. The best definition of contact is tested, on a more realistic scenario, on all predictions including sidechains accepted in the CASP4 experiment. In 30 out of 35 cases the native structure is correctly recognized and best predictions are usually found among the 10 lowest energy predictions.

The definition of contact based on van der Waals radii of alpha carbon and side chain heavy atoms is seen to perform better than other definitions involving only alpha carbons, only beta carbons, all heavy atoms or only backbone atoms. An important prerequisite for the applicability of the approach is that the protein structure under study should not exhibit anomalous solvent accessibility, compared to soluble proteins whose structure is deposited in the Protein Data Bank. The combined evaluation of a solvent accessibility parameter and contact energy allows for an effective gross screening of predictive models.
back
Native loop conformation recognition by MM/PBSA energy calculation- (session: Structural Genomics)

F. Fogolari, S. Tosatto, A. Cestaro, G. Valle, H. Molinari

Universita' di Verona


The fold prediction problem can be divided in two parts: the generation of alternative conformations and the estimation of the stability of every available structure. We show that the latter task can be accomplished by a
a hybrid molecular mechanics/Poisson-Boltzmann solvent accessibility (MM/PBSA) method.
The free energy corresponding to each alternative conformation (involving in the present study only limited loop regions of the protein) has been estimated using the MM/PBSA (molecular mechanics/Poisson Boltzmann Solvent Accessible surface area) methodology (see e.g. refs. 1-3).
In this approach only solute degrees of freedom are explicitly represented, so that the potential of mean force W is written as the sum of a solute energy term  U(r_1, r_2, ..., r_n), estimated after minimization, using the classical forcefield CHARMM (v. 27b2) and a solvation free energy term which can be further split in a polar (electrostatic) and a non-polar (hydrophobic) term: W = U( r_1, r_2, ..., r_n) + Delta G^polar + Delta G^non-polar Delta G^polar has been computed according to the Poisson-Boltzmann theoretical framework (4-5) as the difference in free energy for the hypothetical charging process of the solute in vacuo and in ionic solvent. Delta G^{non-polar} is taken to be proportional to the solvent accessible surface area A i.e. Delta G^non-polar = gamma A (6).
The potential of mean force may be used to estimate the free energy of each conformation, except for the entropic part associated with solute degrees of freedom. This choice is roughly equivalent to assume that the entropy of each conformation is the same.
In analysing results obtained on a database of 726 loops of d1.3 antibody (available from http://dd.stanford.edu/), it must be considered that only low energy conformations should be taken into account, due to the fact that many conformations have poor local geometry and therefore their MM energy is high.
When all the minima at increasing RMSDs are considered a very strong correlation between energy and RMSD from native structure is found and native-like structures are clearly recognized. Our results prove that MM/PBSA may play a relevant role in protein structure prediction and refinement.

1. Fogolari, F., A. Brigo and H. Molinari. 2002. The Poisson-Boltzmann equation for biomolecular
electrostatics: a tool for structural biology. J. Mol. Rec. 15:377-392.

2. Baginski, M., F. Fogolari and J. M. Briggs. 1997. Electrostatic and non-electrostatic contributions to the binding free energies of anthracycline antibiotics to DNA. J. Mol. Biol. 274:253-67.

3. Kollman, P. A., I. Massova, C. Reyes, B. Kuhn, S. Huo, L. Chong, M. Lee, T. Lee,Y. Duan, W. Wang, O. Donini, P. Cieplak, J. Srinivasan, D. A. Case and T. E. 3rd Cheatham. 2000. Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models. Acc. Chem. Res. 33:889-897.

4. Sharp, K. A. and B. Honig. 1990. Calculating total electrostatic energies with the non-linear Poisson-Boltzmann equation. J. Phys. Chem. 94:7684-7692.

5. Fogolari, F. and J. M. Briggs. 1997. On the variational approach to the Poisson-Boltzmann free energies,
Chem. Phys. Lett. 281:135-139.

6. Nicholls, A., K. A. Sharp and B. Honig. 1991. Protein folding and association: insights from the interfacial and thermodynamic properties of hydrocarbons. Proteins: Struct. Funct. Genet. 11:281-96.

back
Bioinformatics within the IASMA grape project: tools for data mining and sequences annotation - (session: Comparative Genomics and Molecular Evolution)

Fontana P.+, Segala C.+, Toppo S.**, Moser C.+, Grando S.+, Valle G.* and Velasco R.+

+ Istituto Agrario di S. Michele all'Adige (TN)
* CRIBI Univ. di Padova
**Dip. Chimica Biologica Univ. Padova


Although grapevine (Vitis Vinifera L.) is one of the most economically important and widely cultivated crop, grape biology is relatively unknown. Our group is involved in a functional genomic project to discover and determine the function of genes expressed in Vitis vinifera with the aim to provide molecular insights into essential physiological processes like photosynthesis, plant defense and biosynthesis of secondary metabolitites. We have based our strategy on sequencing Expressed Sequence Tags (ESTs) obtained from cDNA libraries from different Vitis vinifera tissues such as leaf, root, berry, bud, shoot tips and inflorescence.
The Vitis vinifera species is philogenetically distant to other plants available in public databases. New informations about the organization of Vitis vinifera genome can derive from a wide range comparison of grape sequences to well characterized model organisms such as Arabidopsis thaliana and rice.
The functional characterization of ESTs starts with the clustering process to reduce sequences redundance and pass trough similarity search of public annotated sequences. In order to obtain the most correct annotation we compared the secondary structure prediction of the translated sequences with proteins of known 3D structure.
On the basis of our annotated dataset, some genes of interest, were selected to identify single nucleotide polimorphisms (SNPs), then used in marker-assisted selection and functional mapping.
In parallel, spotting of the amplified ESTs on membrane filters or glass supports will permit true genome-wide sampling of gene expression patterns. Experiments to compare gene expression profiles in leaves on different developmental stages are under way.
All collected data are then stored in a relational database based on a SQL engine. A set of bioinformatic visual tools has been developed to follow all the phases of the project, from the production of cDNA clones, to storage of annotated sequences, microarray profiles and SNPs data. Information retrieve is supported by web interfaces that allow an easy access into the data to users unfamiliar with SQL language.
back

Large contact surface interactions between proteins detected by time series analysis methods: a case study on C- phycocyanins

A. Giuliani, R. Benigni, M. Colafranceschi, I. Chandrashekar , S.M. Cowsik

Istituto Superiore di Sanita' - Roma


A purely sequence-dependent approach to the modeling of protein-protein interaction was applied to the study of C-Phycocyanin  ab dimers. The interacting pairs (a and b subunits) share an almost complete structural homology, together with a general lack of sequence superposition; thus they constitute a particularly relevant  example for protein - protein interaction prediction. The present analysis is based on a description posited at an intermediate level between sequence and structure: i.e. the hydrophobicity patterning along the chains. Based on the description of the sequence hydrophobicity patterns through a battery of nonlinear tools (Recurrence Quantification Analysis and other sequence complexity descriptors), we were able to generate an explicit equation modeling a and b monomers interaction; the model consisted of canonical correlation between the hydrophobicity autocorrelation structures of the interacting pairs.  The general implications of this holistic approach to the modeling of  protein - protein interactions, that considers the protein primary structures as a whole, are discussed.
back

PatSearch: a program for the detection of patterns and structural motifs in nucleotide sequences - (session: Novel Algorithms for Bioinformatics)

Giorgio Grillo, Flavio Licciulli, Sabino Liuni, Elisabetta Sbisà, Graziano Pesole

Sezione di BARI C.N.R.


Regulation of gene expression at transcriptional and post-transcriptional level involves the interaction between short DNA or RNA tracts and the corresponding trans-acting protein factors. Detection of such cis-acting elements in genome-wide screenings may significantly contribute to genome annotation and comparative analysis as well as to target functional characterization experiments.
We present here PatSearch, a flexible and fast pattern matcher able to search specific combinations of oligonucleotide consensi, secondary structure elements and position-weight matrices also allowing for mismatches/mispairings below a user fixed threshold.
We report three different applications of the program in the search of complex patterns as those of the Iron Responsive Element hairpin-loop structure, the p53 Responsive Element, and a promoter module containing CAAT-, TATA- and cap-boxes.
PatSearch is available on the web at http://bighost.area.ba.cnr.it/BIG/PatSearch/
back

Understanding experimental properties of Cu,Zn SODs through molecular dynamics simulation - (session: Structural Genomics)

M. Falconi, A. Desideri

INFM e Universita' di Roma "Tor Vergata", Dipartimento di Biologia


Understanding protein hydration is a crucial, and often underestimated issue, in unravelling protein function. Molecular dynamics computer simulation has been applied to dimeric Photobacterium leiognathi Cu,Zn superoxide dismutase, comparing the water molecule sites determined using 1.0 ns molecular dynamics simulation with those detected by X-ray crystallography. 20% of the water molecules detected by the two techniques fall at common sites. Water molecules trapped in the dimeric protein inter-subunit cavity, as identified in the crystal structure, display a trajectory mainly confined within the cavity although characterized by relatively short residence times because they continuously exchange from one site to another within the cavity.
Limited proteolysis by trypsin of monomeric Cu,Zn superoxide dismutase from Escherichia coli induces a specific cleavage of the polypeptide chain at the level of Lys60 located in the S-S subloop of loop 6,5 where, if compared to the eukaryotic enzyme, a seven residues insertion, completely exposed to the solvent, is observed. Molecular dynamics simulation indicates that the S-S subloop undergoes high fluctuations and that its high flexibility coupled to an high solvent accessibility can explain the specific bond selection of the protease. As a matter of fact of the possible 14 solvent accessible proteolytic sites only the Lys60 flexible site is cleaved. These experiments suggest that molecular dynamics simulation can be used to identify proteolytic sites in proteins.
back

The Human - Mouse Promoter Machine at IFOM: a tool for retrieval of orthologous promoter sequences from genome sequence data

A.Guffanti1, L.Lassandro1, G.Finocchiaro1,2 & H.Muller1,2

1: IFOM – FIRC Institute of Molecular Oncology. Via Adamello, 16 – 20139 Milano, Italy
2: IEO – European Institute of Oncology – Via Ripamonti, 435 – 20141 Milano, Italy


Gene expression in eukaryotes is a highly coordinated process involving regulation at many different levels. The regulation of transcription initiation is an important, and often rate-limiting, step in this process. Although several types of cis-acting DNA sequence elements contribute to this regulation, the simplest element to locate may be promoters, as they are located just upstream of transcription start sites. Until recently, most functional studies of promoters were conducted on a gene-by-gene basis, but there also have been recent attempts to identify promoters on a large-scale with strictly computational methods (Davuluri et al. Computational identification of promoters and first exons in the human genome. Nat.Genet. 29:412-417).

The DNA sequences of entire genomes are being determined at a rapid rate. The extensively annotated human and mouse genome assemblies are available from the joint Sanger Institute – EBI project “Ensembl” (http://www.ensembl.org) at different levels of access, including a dedicated API for direct access of the remote relational database layer.

Starting from these considerations, we have established a project for automated retrieval of human and mouse orthologous genomic regions upstream of the first exon of annotated genes, starting from generic gene identifiers.
In order to compile the list of human and mouse genes that are linked by a relation of homology and possible orthology, at least at a sequence similarity level, we compiled a list from two NCBI resources: HomoloGene (http://www.ncbi.nlm.nih.gov/HomoloGene/) and LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/). We have written a set of scripts, using the BioPerl EnsEMBL API, in order to convert these genes to EnsEMBL identifiers and retrieve from the human and mouse assemblies the DNA sequence corresponding to the first exon and 1000 bp upstream of the first exon start site.
After assigning a unique identifier to these sequences, we have built a query interface that performs an intermediate conversion through the LocusLink database, mirrored and indexed under the SRS system at IFOM. It is possible to query this database using a range of generic gene identifiers (LocusLink identifiers, Gene Names or Gene Aliases, Accession Numbers, Pfam domains, Gene Ontology terms, RefSeq Accession Numbers, EnsEMBL identifiers).
It is possible to retrieve human genomic sequences, mouse genomic sequences, mouse orthologous sequences starting from human identifiers and viceversa. The output consists of DNA sequences in FastA format, where the region corresponding to the first exon is uppercase and the region corresponding to 1000 bp of genomic DNA is lowercase.

We aimed to provide an updated data set of putative promoter regions from the human and mouse assemblies that can be easily queried with generic gene identifiers. This dataset should be useful for researchers interested in in silico promoter work. Future development will include the addition of a set of annotations such as known regulatory elements and cross-species conserved regions; the use of orthology information directly from EnsEMBL; the addition of a graphical output for the annotated regions; the addition of other organisms to the computational pipeline.

The Human<>Mouse promoter machine at IFOM is freely available at web address
http://bio.ifom-firc.it/PROM_MACHINE/index.html

back

The estimation of relative site variability among aligned homologous protein sequences - (session: Comparative Genomics and Molecular Evolution)

D. S. Horner and G. Pesole

Dipartimento di Fisiologia e Biochimica Generale, Universita di Milano


Maximum likelihood-based methods to estimate site by site substitution rate variability in aligned homologous protein sequences rely on the formulation of a phylogenetic tree and generally assume that the patterns of relative variability follow a pre-determined distribution. We present a phylogenetic tree-independent method to estimate the relative variability of individual sites within large datasets of homologous protein sequences. It is based upon two simple assumptions. Firstly that substitutions observed between two closely related sequences are likely, in general, to occur at the most variable sites. Secondly that non-conservative amino acid substitutions tend to occur at more variable sites. Our methodology makes no assumptions regarding the underlying pattern of relative variability between sites. 

We have compared, using data simulated under a non-gamma distributed model, the performance of this approach to that of a maximum likelihood method that assumes gamma distributed rates. At low mean rates of evolution our method inferred site by site relative substitution rates more accurately than the maximum likelihood approach in the absence of prior assumptions about the relationships between sequences. Our method does not directly account for the effects of mutational saturation, However, we have incorporated an “ad-hoc” modification that allows the accurate estimation of relative site variability in fast evolving and saturated datasets.
back

Genome-wide analysis of the sequence region sorrounding the transcription start site of human mRNAs - (session: Comparative Genomics and Molecular Evolution)

Michele Iacono, Flavio Mignone  and Graziano Pesole

Università degli Studi di Milano


Gene expression is finely regulated at both the transcriptional and post-transcriptional level. Transcriptional control is mediated by transcription factors, RNA polymerase and a series of cis-acting elements located in the DNA. Most important cis-elements are located whitin the Core Promoter Region, in close proximity to the transcription start site (TSS), and are classified as upstream or downstream  promoter elements according to their position whit with respect to theTSS.
One of the main problems in studying regulation of gene expression is the identification of the motifs that have trascriptional meaningare functionally important in transcriptional regulation, and the genes each motif regulates.
The recent availability of the draft human genome draft sequence as well as of a very large number of full length transcript sequences now make it makes now possible to carry out an extensive and systematic comparative study of the genomic context of the TSS.

We present here a comprehensive sequence analysis, performed on a Human  Core Promoter (HCP) dataset including 3140 sequences. HCP sequences have been extracted from human genome assembly (Release n°30) based on TSS mapping determined after comparison with the DBTSS and Refseq collections of reference transcripts. Putative cis-elements involved in transcription regulation have been identified through the application of pattern discovery algorithms and their presence in orthologous mouse genes has also been also investigated.
back
Evolution of gene family in eukaryotes: the BCL-2 gene family - (session: Comparative Genomics and Molecular Evolution)

Cecilia Lanave, Monica Santamaria and Cecilia Saccone

Sez. territoriale,Bari, dell'ITB, Milano, CNR


The members of the Bcl-2 family can be subdivided into anti-apoptotic and pro-apoptotic proteins. A delicate balance between these members exists in each cell and the regulations of these two groups of proteins determines whether the cell survives or undergoes apoptosis. In mammals 15 Bcl-2 family members have been identified to date and other similar members have been found in various eukaryotic organisms. All members possess at least one of the four motifs known as Bcl-2 homology domains (BH1 to BH4). Most pro-survival members of  Bcl-2 family, which can inhibit apoptosis in the face of a wide variety of cytotoxic insults, contain at least BH1 and BH2 domains; those most similar to Bcl-2 have all four BH domains. All the Pro-apoptosis family members possess BH3 domain wich is the central domain.
Our interest in the studies of gene family evolution has focused on the cladistic analyses of Bcl-2 gene family members. These proteins show a different composition with regard to  the functional domains BH1, BH2, BH3 and BH4.  The analyses were performed both on complete sequences (140 sites analysed) and on single domains. We present the results obtained using both approaches.
back

RRE & ClAW: two new java tools for microarray data mining - (session: Other)

F Lanzarato, G Iazzetti, E Caserta, M. Botta, G Franceschinis, RA Calogero

Università di Napoli Federico II


The availability of highthroughput technology as microarrays rises up the need for biologists of new computational tools to investigate the functional implications related to the presence of transcriptional differential expressions.
For this reason, last year, we have finished the developed MedMOLE a prototype tool to categorize and simplify the tedious work of getting over the literature related to coregulated genes.
This time we present two tools written in java:
RRE (Regulative Region Extractor) and ClAW (Clustering Analyser Wrapper).
RRE is a tool allowing the extraction of all potential regulative regions from genomic data files. In particular, it uses the GBS or GBK files to identify the gene/CDS annotations and extracts gene upstream regions (default 2000 kb), 5'UTRs, introns, 3'UTRS, gene downstream region (default 1000 bp) from the corresponding FA files.
The tool extracts in fasta format the previously described regions from NCBI human/mouse/drosophila genomes data in 2-4 hours depending on the available hardware.
RRE can be linked to an automatic data downloader based on CURL and it allows the rebuild of the data set any time an update is available at NCBI.
We found this tool very useful to generate the data sets needed to perform genome-wide analysis on transcriptional signals present in regulative regions.
Furthermore, in april it will be available at www.bioinformatica.unito.it a web interface to the data generated by RRE (Human/Mouse/Drosophila) based on SPITFIRE and accessible upon registration.
ClAW instead is a graphical interface to the clustering suite CLUTO, which allows gene clustering on the basis of GO Biological process annotation.
The tool uses locusLink GO annotation to assign the GO terms to a list of LL Ids given by the user, making simpler the use of CLUTO via a graphical interface and producing graphical and textual outputs of the clustering results.
User can also request to integrate the set of GO terms used for the clustering with those available for the orthologous genes.The possibility to integrate in the clustering also the GO terms of orthologous organisms is helpful to fit into clusters poorly annotated genes. We found ClAW particularly useful to functionally associate differentially expressed genes derived by microarray experiments.
back
PRIMEX 1.0 and VPCR 2.0: Processing genomic sequence data for efficient and accurate simulation of PCR reactions with genomic DNA as template - (session: Novel Algorithms for Bioinformatics)

Matej Lexa, Ivano Zara, Giorgio Valle

CRIBI, University of Padova


Increased availability of genomic sequence data provides space for bioinformatic tools utilizing these large datasets in novel applications. We have set out to automate the prediction of PCR reaction products using arbitrary primers and genomic DNA as template. While this may seem a trivial task for a well-designed pair of primers, the task becomes much more challenging in a wide range of special situations. Firstly, the search for primer annealing sites becomes prohibitively slow on large genomes with currently available tools. Secondly, a mathematical model of the PCR reaction is required to simulate amplification in cases where primers go through a wide range of states besides the desired binding to the target sequence (unspecific binding, secondary structure formation), or where several amplification products compete for polymerase activity (multiplex PCR). We present a set of programs that address these problems and rapidly predict the outcome of any PCR reaction. PRIMEX is a tool that can find all relevant primer-binding sites in a single genome in a fraction of a second. VPCR is a set of routines that analyze the output provided by PRIMEX and run a dynamic mathematical model of the PCR amplification process. We will show our first results that compare VPCR output and real PCRs performed in the laboratory.
back

An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins - (session: Novel Algorithms for Bioinformatics)

Pier Luigi Martelli, Piero Fariselli, and Rita Casadio

Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy


ABSTRACT
Motivation: All-alpha membrane proteins constitute a functionally relevant subset of the whole proteome. Their content ranges from about 10 to 30% of the cell proteins, based on sequence comparison and specific predictive methods. Due to the paucity of membrane proteins solved with atomic resolution, the training/testing sets of predictive methods for protein topography and topology routinely include very few well-solved structures mixed with a hundred proteins known with low resolution. Moreover, available predictors fail in predicting recently crystallised membrane proteins (Chen et al., 2002). Presently the number of well-solved membrane proteins comprises some 59 chains of low sequence homology. It is therefore possible to train/test predictors only with the set of proteins known with atomic resolution and evaluate more thoroughly the performance of different methods.
Results: We implement a cascade-neural network (NN), two different hidden Markov models (HMM), and their ensemble (ENSEMBLE) as a new method. We train and test in cross validation the three methods and ENSEMBLE on the 59 well resolved membrane proteins. ENSEMBLE scores with a per-protein accuracy of 90% for topography and 71% for topology, outperforming the best single method of 7 and 5 percentage points, respectively. When tested on a low resolution set of 151 proteins, with no homology with the 59 proteins, the per-protein accuracy of ENSEMBLE is 76% for topography and 68% for topology. Our results also indicate that the performance of ENSEMBLE is higher than that of the best predictors presently available on the Web.
Keywords: all-a membrane proteins; Neural networks; HMM; protein structure prediction; membrane protein topology.
Contact: gigi@biocomp.unibo.it, www.biocomp.unibo.it

back

Handling global expression data from multiple microarray platforms

E. Medico, L. D'Alessandro, A. Gentile

Istituto per la Ricerca e la Cura del Cancro (IRCC)
Strada Prov. 142 - 10060 Candiolo (TO)


Concordance between genomic expression profiles obtained on different DNA microarray platforms may be impaired by multiple concomitant factors, including non-homogeneity in data analysis procedures. To address this issue, we explored growth factor-induced transcriptome changes using the same series of mRNA samples with two types of DNA microarrays, respectively spotted cDNAs and high-density oligonucleotides (HDO).
Unigene database matching allowed us to estimate a total of over 25'000 genes explored, of which 6000 covered by both platforms, thus providing a wide basis for comparison and systematic cross-validation. A common procedure was specifically developed to minimize differences in data analysis between the two platforms. In particular, regulated genes were identified by a 'tunable' statistic test weighing expression change and duplicate variability across the whole experiment.
Data permutations and extensive false discovery analysis allowed accurate, platform-specific optimization of the test. Cross-validation was observed as a generally good correlation coefficient between the time-course responses observed in the two platforms. Interestingly, a similar extent of correlation was observed for genes redundantly explored within the same microarray, which provides evidence of substantial homogeneity between the microarray platforms examined.
back
MINT a Molecular INTeraction database - (session: Database: Ontology and Integration)

Luisa Montecchi-Palazzi, Andrea Cabibbo, Andreas Zanzoni, Manuela Helmer-Citterich, Gianni Cesareni

Universita' Tor Vergata


MINT is a public protein interaction database focused on collection of experimentally verified data disseminated in the scientific literature. MINT entries are extracted by expert curators assisted by "MINT Assistant", a software that targets abstracts containing interaction information and presents them to the curator in a user friendly format. Furthermore MINT aims at being exhaustive in the description of the interaction and, whenever available, information about kinetic and binding constants and about the domains participating in the interaction are included in the entry. All information is collected in a computer readable form and stored on a web accessible database where interaction data can be easily extracted and viewed graphically through "MINT Viewer". Presently MINT contains 2098 manually curated interactions, 1516 of which are interactions among mammalian proteins. MINT is accessible at http://cbm.bio.uniroma2.it/mint/. To facilitate the inclusion in MINT of unpublished interaction data, we are starting a new online peer-reviewed journal specifically aimed at the publication of rigorously documented molecular interactions not suitable for standalone publication on other journals. The "MINT journal for Molecular Interactions" will also publish focused reviews on selected molecular interaction networks or pathways. A preliminary version of the MINT journal is available online.
back

Integration of data from different sources: a prototype devoted to p53 mutations - (session:  Database: Ontology and Integration)

A. Mucci°, A. Cusmano°, M. De Francisci^, M. A. Manniello^, D. Marra^, P. Romano^, G. Mauri°

°Università di Milano Bicocca, Dipartimento di Informatica, Sistemistica e Comunicazione - DISCO, Milano
^Istituto Nazionale per la Ricerca sul Cancro - IST, Genova


Oncology Over Internet is a project devoted to integration of data from different sources of oncology interest. The main focus of the project is on the software architecture, but important improvement for in silico biology research and for some clinical investigations are foreseen.

A prototype has been developed in order to test different technical solutions to access and integration issues and verify the overall feasibility of the system. The prototype is focussed on the database of mutations of the TP53 gene that is maintained by the International Agency for Research on Cancer (IARC). The TP53 gene expresses the p53 protein which has some important and well known influences in the control of cancer at the very preliminar steps and in the elimination of mutated cells. The database of the TP53 mutations has some implicit nand explicit links with some molecular and cellular biology databases, such as sequence databases, literature databanks and human cell lines catalogues.

The prototype includes the user interface, the Java based search engine, that is in charge of carrying out of the queries and of gathering the information, a knowledge base and a database where the data retrieved from the various information sources is stored.
The user can ask for the execution of a query by submitting the proper parameters through an online form to the main server. The search engine will query every single database invlved in the query, after checking the contents of the knowledge base; the results will finally be returned to the user.

The main application is structured in several blocks, each of which have a specific function:
·    Query the knowledge base
·    Preparation of the results' table
·    Selection of the involved databases
·    Preparation of the query
·    Analysis of the sites hosting involved databases
·    Query of the involved databases
·    Gathering and restructuring of the results
·    Displaying of results

The prototype is based on the knowledge base, whose goals are: 1) to select the information sources, 2) to carry out queries, 3) to extract the required information and 4) to integrate data.

The prototype has been developed by using open source softwares and products: MySQL database management system, Apache Tomcat web server and Java programming language. At present, the prototype is under beta test.
back

REELIN IS A HEPARIN BINDING PROTEIN: IN VITRO TESTING AND IN SILICO ANALYSIS - (session: Structural Genomics)

Roger Panteri1*, Alessandro Paiardini2*, Ramona Marino1 , Stefano Pascarella2,3,4 , Gabriella D’Arcangelo5 and Flavio Keller1

1Laboratory of Developmental Neuroscience, Università “Campus Bio-Medico”, Rome, Italy 2Dipartimento di Scienze Biochimiche “A. Rossi Fanelli” and Centro di Biologia Molecolare del Consiglio Nazionale delle Ricerche, , 3Centro Interdipartimentale di Ricerca per la Analisi dei Modelli e dell’Informazione nei Sistemi Biomedici (CISB), 4Centro di Eccellenza di Biologia e Medicina Molecolare (BEMM), Università degli Studi di Roma “La Sapienza”, Rome, Italy, and      5 The Cain Foundation Laboratories, Department of Pediatrics, Division of Neuroscience, Baylor College of Medicine, Houston, Texas 77030, USA


Reelin is a large molecule of the extracellular matrix (ECM) which regulates neuronal positioning during the early stages of cortical development in vertebrate species1,2,3. The localization of Reelin in the ECM, its modular assembly and its role in the regulation of neuronal migration led us to suppose a function for its modules in binding to polysaccharides commonly found on proteoglycans of the ECM, similar to that observed for the repeat modules of Laminins and Thrombospondins. We investigated whether Reelin could interact with the polysaccharide heparin using an affinity chromatography approach followed by immuno-blot analysis. The results obtained indicate an important specific interaction between Reelin and the heteropolysaccharide heparin; moreover the data support the involvement of the Reelin subrepeats in the binding. Further bioinformatic analysis and three-dimensional modeling of the Reelin subrepeat regions confirm the presence of structural features common to polysaccharide binding modules, like an ASP-BNR hairpin loop, large aromatic residues and a series of basic arginine residues, located on the surface cleft of the 3D model of a Reelin subrepeat, and potentially involved in the binding to polysaccharides. These findings provide new insights into the structural organisation of Reelin and novel hypothesis concerning the molecular function of this large ECM molecule, that could be tested experimentally. Finally, this work points to new directions in the research of therapeutic compounds that can modulate the activity of Reelin, given the importance of this protein in several human neurodevelopmental disorders.

1) D'Arcangelo, G., Miao, G.G., Chen, S.C., Soares, H.D., Morgan, J.I., and Curran, T. 1995. A protein related to extracellular matrix proteins deleted in the mouse mutant reeler. Nature 374: 719-723.
2) Quattrocchi, C.C., Wannenes, F., Persico, A.M., Ciafrè, S.A., D'Arcangelo, G., Farace, M.G., and Keller, F. 2002. Reelin is a serine protease of the extracellular matrix. J. Biol. Chem. 277: 303-309.
3) Rice, D.S., and Curran, T. 2001. Role of the Reelin signalling pathway in central nervous system development. Annu. Rev. Neurosci. 24: 1005-1039

back

Structural model for Gas1p family members by combined  threading and secondary structure prediction methods - (session: Other)

Elena Papaleo, Gianluca Santarossa, Marina Vai, Piercarlo Fantucci, Luca De Gioia

Università di Milano Bicocca - Dipartimento di  Biotecnologie e Bioscienze


The Gas1p is a S.Cerevisiae membrane glycoprotein that plays a key role in cell wall assembly [1], and belongs to the Gas1p family 72 of b-1,3 glucanases. Several others family members were isolated from  S.Cerevisiae and from Candida species, S.Pombe and other fungal organisms.
In particular, five gas genes were present in S.Cerevisiae coding for different Gas enzymes, each characterized by a different modular organization of domains.
 The catalytic domain (C-domain) is the most conserved  among all members of the family and its structural features are particularly relevant to investigate structure-function relationships in this class of enzymes.
Aim of this work was the prediction of the 3D structure of this domain and the comparision of C-domains of different members of the Gas1p family. Due to the unavailability of a 3D structure template suited for homology model construction, we combined threading methods [2] and secondary structure predictions to derive 3D models of some Gas1p family members.
Base on this analysis we propose that the C-domain assumes a TIM-barrel fold and that the portion of the active site residues in our models are compatible with the catalytic characteristic proposed for GHA clan members [3] and we conduct a detailed analysis and comparision of the structural features of C-domains of some of the different members of Gas1p family. SOLO POSTER

1.    Popolo L.,  Vai M., The Gas1 glycoprotein, a putative wall polymer cross-linker, Biochim. Biophys. Acta. 1999;1426(2):385-400.

2.    Jones D. , Thornton J., Protein fold recognition, J. Comput. Aided. Mol.Des., 1993, 4: 439-456.

3.    Henrissat B., Callebaut I., Fabrega S., Lehn P., Mornon J.P., Davies G., Conserved catalytic machinery and the prediction of a common fold for several families of glycosyl hydrolases, PNAS, 1996, 93(11):5674

back
An Algorithm for Finding Common Secondary Structure Motifs in a Set of Unaligned RNA Sequences - (session: Novel Algorithms for Bioinformatics)

Giulio Pavesi, Giancarlo Mauri, Graziano Pesole

Università Milano Bicocca


We present an algorithm for finding conserved secondary structure motifs in a set of RNA sequences, that is, secondary structure elements that appear in all or most of the secondary structures formed by the sequences of the set.
Differently from the methods introduced so far for this problem, the approach we present does not compute an alignment of the sequences beforehand, nor takes into account sequence similarity in any way, but looks directly for structural similarities. Thus, it can be applied also to cases when RNA sequences do not present significant similarity in their nucleotide sequence.  The algorithm takes as input the secondary structure of the sequences, exhaustively enumerates all pattern representing feasible secondary structure elements up to a maximum size (that can equal the length of the sequences), searches for each one in the structures, and finally reports those patterns that appear in all or most of the sequences of the set.
Occurrences of patterns can be approximate, that is, can differ in the size of a stem, of an internal loop, in the presence or not of a bulge, and so on: the type and degree of approximation can be chosen at freedom by the user.
The input structures can be either determined experimentally, or predicted by one of the existing methods. In the latter case, we show how the algorithm can deal with the uncertainty deriving from predictions, by considering different alternative secondary structures for each sequence.
Experiments have shown that the algorithm, coupled with existing secondary structure prediction methods, is able to discover efficiently known RNA structural motifs, such as histone and IRE stem-loop motifs in RNA untranslated regions, as well as structural motifs shared by the members of different virus families.
back

Computational analysis of non-coding regions in eukaryotic genomes - (session: Comparative Genomics and Molecular Evolution)

E. Pizzi, E. Bultrini, P. Del Giudice, C. Frontali

Istituto Superiore di Sanità, Roma


Genome sequencing projects determine a large amount of sequence data each year. One of the major challenges for computational biologists is to extract relevant biological information from billions of Megabases that have been stored in the databases so far. Whereas, in the last years, many efforts have been devoted to locate genes within genomes, relatively few tools have been developed to identify the regulatory regions required for the correct transcriptional activity of the genome. This task is particularly difficult in the case of eukaryotic organisms for which regulatory regions represent a small percentage overwhelmed by, presumably, non-functional DNA. Recently, several computational procedures are emerging to solve this problem, including knowledge-based methods, comparative genomics analysis as well as methods based on statistical-compositional properties of genomes.
By using recurrence quantitative analysis we were able to show that in some eukaryotic genomes, introns and intergenic tracts exhibit highly recurrent patterns with correlated properties distinguishing them from the low-recurrence regime present in exons. This observation was explained assuming a peculiar oligonucleotide usage in non-coding DNA and significant different in protein-coding regions. In order to characterise this oligonucleotide usage, we applied principal component analysis on pentamer distribution of experimentally introns and exons from C.elegans and D. melanogaster genomes. We found a subset of pentamers that significantly discriminate introns from their randomised counterparts and from exons. A genome-wide analysis of pentamer usage revealed that most introns and intergenic tracts utilize the identified subset of pentamers, whereas exons and a small percentage of non-coding fraction do not.
Our hypothesis is that genome pentamer-usage could be reviewed as a sort of genome background noise and hence functional sequences might emerge as regions having different compositional properties. In order to test our hypothesis, we analysed the 5‚ upstream regions of more than 100 members of a multigene family from P.falciparum genome. We identified four regions, within 1 kb, with an anomalous oligonucleotide-usage; we compared our results with those obtained through a multiple alignment performed on the same sequences.
The overall compositional property could be reviewed as a sort of genome background. Regulatory elements might take place within regions that adopt a different oligos usage.
back

Annotation of EST sequences by a structural bioinformatics approach - (session: Other)

L. Pugliese

via Don Grioli, 4 -Torino


In the past few years the complete sequences of more than 55 genomes have been published and at least 100 more are known to be near completion. One challenge of the genome era is to predict molecular functions and biological roles for the predicted gene product.

Most approaches for the tentative assignment of functions to predict proteins are based on pairwise sequence similarity searches against known proteins using sequence comparison programs such as FASTA and BLAST. However many proteins are multifunctional multidomain proteins for which the assignment of a single function results in loss of information. Also with more predicted proteins from genome projects being added to the databases, the best hit in pairwise sequence similarity searches is frequently a poorly annotated protein.

    To overcome limitations of functional annotation based on pairwise sequence similarity searches, the addition of knowledge of the three dimensional structure of domains gains more and more importance. In this view  the application of fold recognition methods coupled to homology model building and theoretical structure verification methods represents a way to get a lot of information in a short time.

The protocol applied in order to assign a function to an EST sequence involves the following steps:
1. Submission of the sequence to a fold recognition/structure prediction metaserver.
2. 3D alignment of sequences relative to templates receiving good scores from the metaserver.
3. Homology model building using as template the pdb files having the best scores within the metaserver output.
4. Evaluation of the models obtained by the program ProsaII and the server http://atlas.physbio.mssm.edu:8084/servers/pg/
5. Analysis of the literature concerning the template structures in order to extract information on the function of the new sequence.
back

Detection and analysis of spliced chimeric mRNAs in sequence databanks - (session: Novel Algorithms for Bioinformatics)

Antonello Romani, Marco Trerotola, Emanuela Guerra, Andrew Emerson, Elda Rossi, Agnieszka Bronowska and Saverio Alberti

Mario Negri Sud


A databank screening procedure, the In Silico Trans-splicing Retrieval System (ISTReS), was developed to identify chimeric mRNAs originating from chromosomal translocations, mRNA trans-splicing and multi-locus transcription. A parsing algorithm to screen cDNA-vs-genome Blast outputs was implemented. Key filtering criteria were Blast scores of >= 300, match lengths of >= 95% of the query sequences, junction of the two partners at exon-exon borders and concordant Œsense / sense‚ reading orientation. ISTReS was validated by the successful identification of bona fide chromosomal translocation-derived fusion transcripts in the HGI and RefSeq databanks. The performance of ISTReS was verified against recently identified chimeric antisense transcripts. Analysis of the UNIGENE database revealed 21742 chimeric sequences overall, that correspond to ~ 1% of the database transcripts. Novel FOP-Rho GAP and methionyl tRNA synthetase-advillin chimeric mRNAs with the canonical features of trans-spliced-transcripts were identified among 246 chimeras from the RefSeq databank. This suggests a frequency of canonically-spliced chimeras of approximately 1% of all the hybrid sequences in current databanks. These findings demonstrate the efficiency of ISTReS and the overall feasibility of sequence/structure-based strategies to search for chimeric mRNAs candidate to derive from the splicing of heterologous transcripts.
back

Analysis of p63 isoform-driven gene expression: a cDNA array/bioinformatics integrated approach - (session: Structural Genomics)

S. Saviozzi, M. Lo Iacono, F. Lanzarato, G. Franceschini, G. La Mantia, V. Calabrò and R.A. Calogero

Università di Torino


Two homologs of p53 have been identified: p73 and p63 (Kagdah et al. 1997; Yang et al. 1998). The hallmark features of the p53 protein - an acid amino-terminal transactivation domain (TA), a core domain for DNA-binding and a carboxy-terminal oligomerization domain - are shared by both p63 and p73. P73 and p63 are also characterized by the presence at C-terminus of a sterile alpha-motif (SAM)-like sequence. The p63 gene encodes at least six polypeptides by way of two different promoters/ATG (TA and deltaN isoforms) and three alternatively spliced C-terminal regions.
P63-controlled gene expression profiling was investigated hybridising total RNA extracted form p53 -/- cell line (SAOS-2) transiently transfected with the six p63 isoforms on cDNA arrays (8395 cDNAs, Incyte LifeGrid). We identified 384 differentially expressed genes passing two rounds of statistical validations (Tusher et al. 2001; Baldi and Long 2001). We observed that TAp63beta prevalently induces up-regulation, TAp63gamma mainly induces down-regulation and deltaN isoforms, although without transactivation domain, are able to induce a transcriptional response. Transcriptional profiles of genes controlled by the six isoforms were grouped using k-way and adaptive quality-based clustering approaches. Both methods show that three classes are the optimal solution for our data set and the three clusters were defined as UP, i.e. genes up-modulated by all isoforms, DOWN, i.e. genes down-modulated by all isoforms and MIX, genes up-modulated only by TAp63beta, dNp63beta and dNp63gamma isoforms.
We used these p63-controlled genes as starting material to evaluate the presence of common regulative elements in promoters of co-regulated genes. We extracted from NCBI human genomic data 2Kb upstream to the annotated 1st transcribed nucleotide of the p63-controlled genes and we mapped the presence of know human transcriptional elements, described in TRANSFAC professional 6.1, using PATSER program (Hertz and Stormo 1999). Then, we applied a computational data mining technique used in market basket analysis: the Agrawal association rules induction algorithm (Agrawal 1993), which is a powerful method to find regularities in a set of documents/transactions (e.g. a commercial association rule like "If a customer buys wine and bread, he often buys cheese, too." can be rewritten in a gene oriented way: "If an upstream region contains at least one AP2 and one SP1, it often contains HRE, too).
The association rules induction algorithm was used to find specific rules (group of transcriptional elements) associated to genes modulated by p63 isoforms also contain p53 responsive elements (p53RE). Subsequently we have identified the rules containing p53RE and we have evaluated their frequencies in the transcriptional profile clusters (UP, DOWN e MIX). By this approach we have identify respectively 12, 19 and 20 rules which are statistically specific (p<0.001) for DOWN, MIX and UP clusters. We are actually investigated the presence of this rules in hortologous genes and their relative distance in p63-controlled genes.
back

GeneGrid: a workflow system for sequences analysis - (session: Database: Ontology and Integration)

Roberto Specchio, Andrea Caprera, John Hatton, Luciano Milanesi

Istituto di Tecnologie Biomediche, CNR, Segrate (Mi)


Here we present a project concerned with the realization of a new infrastructure for bionformatics computing.
The infrastructure consists in a workflow system, controlled by a job manager, where a pipeline analysis is loaded into a database and executed on a parallel complex. The aim of this system is to perform complex bionformatics analysis, where the single analysis tools are concatenated in an automatized procedure. The system is constituted by the following components:
- Server Host, which has the function of controlling the analysis process, and where the pipeline is loaded into a database.
- Master Host, which has the function of controlling the process. The jobs are first submitted to this host, where they are scheduled and sent to the Execution hosts.
- Execution Hosts are the nodes of the complex where the jobs are executed. When a job is completed, the Master host communicates with the Server host, where the status of the job is updated into the table Job of the database corresponding to the pipeline analysis, and the next jobs in the analysis process (if present) are sent to execution.
- Database Repository, where the information is retrieved from the programs in execution.

We included in our system tools for similarity analysis on nucleotide or protein sequence databases, such as BLAST, PSI-BLAST, and BLAT, several methods implemented for gene prediction and automatic annotations, and analysis of ESTs distribution in tissues and organs.
We are further including in our analysis system, tools for both structure and function comparative analysis. We are also implementing tools for protein interaction similarity analysis in order to complement the analysis based on sequence similarity search programs with procedures for the analysis of molecular interactions fields. We are also integrating this infrastructure into a Virtual Organization, in order to share and interface our resources in a collaborative environment. This will be realized by means of the Globus toolkit, which is a package providing a set of Grid services.
back

The Mechanism of Interaction of Sweet Proteins with their Receptor: Modelling the Complexes - (session: Other)

Pierandrea Temussi

Universita' di Napoli Federico II


Most sweet compounds, including all popular sweeteners, are small molecular weight compounds of widely different chemical nature, but a few sweet proteins are also known. Sweet tasting proteins have different molecular lengths (from the 54 residues of brazzein to the 202 residues of thaumatin), virtually no sequence homology and very little structural homology.
Why are sweet proteins sweet?  From a very general point of view it is natural to think that glucophores, i.e., key groups responsible for the biological activity of sweet proteins similar to those of small molecular weight sweeteners, are localized on a protruding structural feature, i.e. a sweet finger that can probe the active site of the receptor.
Sweet molecules elicit their taste, in humans and other mammals, by interacting with the T1R2-T1R3 receptor. The sequence of this protein indicates that it is a metabopromic 7TM GPCR receptor with a high homology to the mGluR subtype 1 (Margolskee, 2002). The structure of the N-terminal part of the mGlu receptor, determined by X-ray diffraction (Kunishima et al., 2000), has been used as a template to build homodimeric (Margolskee, 2002) and heterodimeric (Temussi, 2002) models of the T1R3-T1R3 receptor. It is very likely that small molecular weight sweet molecules occupy a pocket analogous to the glutamate pockets in the mGlu receptor, possibly similar to the active site models predicted by indirect receptor mapping studies. However, the glucophores of sweet proteins have not yet been identified with certainty, even for sweet proteins of known structure.  In addition, it is difficult to envisage the same type of interaction for sweet proteins.
Recently, an alternative explanation, based on the spread of key residues on a large surface area, has been suggested (Temussi, 2002).  This new model of interaction hypothesizes that the site of interaction with the T1R2-T1R3 receptor is different for small molecular weight sweeteners and for sweet proteins.  Sweet proteins should bind to the surface of form II of the ligand-free receptor stabilizing it. The complexes of brazzein, monellin and thaumatin with the heterodimeric model of the T1R2-T1R3 receptor show that all three proteins bind as wedges on a cavity of the receptor (Temussi, 2002). The crucial features of this model are that, by analogy to the mGlu receptor, free form II is the active form of the receptor and that a large interaction surface does not require identical glucophores for different sweet proteins.

Comparison of the structures of wild type monellin and  G16A, a structural  mutant, shows that the mutation does not affect the structure of potential glucophores but produces a distortion of the surface owing to the partial relative displacement of elements of secondary structure.  This result shows conclusively that sweet proteins do not possess a sweet finger and strongly supports the hypothesis that the mechanism of interaction of sweet tasting proteins with the sweet receptor is different from that of low molecular weight sweeteners.

References
Kunishima, N., Shimada, Y., Tsuji, Y., Sato, T., Yamamoto, M., Kumasaka, T., Nakanishi, S., Jingami, H. and Morikawa, K. (2000) Nature 407, 971-977
Margolskee RF. (2002) J Biol Chem 277, 1-4
Temussi, P. A.(2002) FEBS Lett., 526, 1-4.

back

A structural study for the optimization of functional motifs encoded in protein sequences - (session: Structural Genomics)

A. Via, M. Helmer-Citterich

Centre for Molecular Bioinformatics, Dept. of Biology, University of Rome Tor Vergata


Many PROSITE sequence motifs match all and only the known true positive sequences. However a great number of PROSITE motifs (referred in the following as „leaky‰ patterns) select false positives and/or miss known true positives. It is possible that ˆ at least in some cases - the weak specificity and/or sensitivity of a pattern is due to the lack of key residues in the sequence pattern. Indeed, while localized in space, functional residues may be dispersed along the protein sequences and therefore not easily detectable in multiple sequence alignments. In the present work, a set of eight PROSITE deterministic patterns was selected among the „leaky‰ ones and the structures of the corresponding true positives were analyzed by means of the 3D profile  procedure (de Rinaldis et al., 1998) and by visual inspection. For each considered pattern, this approach allowed the identification of sets of residues structurally well conserved in the protein surface region nearby the PROSITE residues. In seven out of eight cases only a subset of the conserved amino acids belongs to the original PROSITE pattern while the remaining part falls outside. Only in one case (CYTOCHROME_C) all the structurally conserved residues belong to the original pattern. The structurally conserved residues falling outside the PROSITE patterns were used to build what we called EXTENDED sequence patterns. The CYTOCHROME_C pattern was discarded.
RESULTS: EXTENDED patterns were matched against the SWISS-PROT database to test their sensitivity and specificity. In all the analyzed cases, the addition of information inferred from structural analysis improved pattern specificity and in some cases both specificity and sensitivity.
back

Annotation of Photobacterium profundum genome - (session: Database: Ontology and Integration)

Nicola Vitulo, Alessandro Cestaro, Alessandro Vezzi, Michela D'Angelo, Francesca Simonato, Giorgio (Mitch) Malacrida, Stefano Campanaro, Giorgio Valle

CRIBI, Università di Padova


The Genomic Research Group of CRIBI, University of Padova, has recently completed the genomic sequence of Photobacterium profundum, an extremophile bacterium adapted to life at high pressure and low temperature. The genomic sequence of this bacterium has been obtained in our laboratory and is more than 6 million base pairs long. We are currently in the finishing phase of the project, aiming to close the remaining  20 gaps and to confirm some genomic regions that are covered with low quality sequences.
The genome of P. profundum is organized into 2 main circular chromosomes and a plasmid 80Kb long. The gene density is approximately one gene per 1090 bp and, as a result, about 6000 genes are present.
In the poster we will describe the current status of the annotation and the procedures that we have applied to analyze this interesting genome.
back

Disulfide Connectivity Prediction using Generalized Recursive Neural Networks and Evolutionary Information - (session: Structural Genomics)

Alessandro Vullo and Paolo Frasconi

Dipartimento di Sistemi e Informatica, Università di Firenze


We focus on the prediction of disulfide bridges in proteins starting from their amino acid sequence and from the kowledge of the disulfide bonding state of each cysteine. The location of disulfide bridges is a structural feature that conveys important information about the protein main chain conformation and it can therefore help  towards the solution of the folding problem.  Existing approaches based on weighted graph matching algorithms do not take advantage of evolutionary information. Recursive neural networks (RNN), on the other hand, can handle in a natural way complex data structures such as graphs whose vertices are labeled by real vectors, allowing us to  incorporate multiple alignment profiles in the graphical representation of disulfide connectivity patterns.
The core of the method is the use of machine learning tools to rank alternative disulfide connectivity patterns. We develop an ad-hoc RNN architecture for scoring labeled undirected graphs that represent connectivity patterns. We report experimental results on a set of cysteine-rich non-homologous sequences. We found that using multiple alignment profiles allows us to obtain a significant improvement of prediction accuracy, clearly demonstrating the important role played by evolutionary information.
back

Development of new bioinformatic tools to analyze the HLA genetic system - (session: Novel Algorithms for Bioinformatics)

Ivano Zara, Riccardo Schiavon, Giorgio Valle

CRIBI Biotechnology Centre; University of Padova


In the human genome, the HLA genetic system is characterized by the highest complexity level and SNPs density among humans. Our group at CRIBI has developed many bioinformatic tools useful to build and maintain a database containing most of the official sequences for the WHO Nomenclature Committee For Factors of the HLA System.
Our tools have been envisaged to analyze and make the public HLA sequences more informative and to make easier the detection  of short oligonucleotides useful as typing PCR primers.
First of all, we have produced a database that represents a good starting point for HLA sequence data analysis. In fact, one of the first finding was the detection of several inaccuracies in the IMGT/HLA database, that is the best immunogenetic resource publicly available.
The most innovative feature is a new aligning method among oligonucleotidic and allelic sequences based on ibridisation thermodinamic parameters. This program calculates these parameters using the most reliable algorithms, predicting the melting temperature of the hybrid also in case of mismatches.
This method allowed to develop a specific program that finds out all putative reaction probes or primers, which are then selected on the bases of their capability to identify groups of alleles, permitting to realize full typing kits. Another program simulates, on a thermal gradient, a PCR-based  typing reaction against all HLA alleles, also revealing non-specific allele amplification that may produce wrong typing interpretation.
Some other utilities are available at our web address http://grup.cribi.unipd.it/projects/HLA
back