Basilis Gidas
Office Hours: Thurs 3 - 4 and By appointment
Biography
Ph.D., University of Michigan, Ann Arbor, 1970
During the past eight years, the research interests of Professor Basilis Gidas have been in the identification and analysis of transcriptional regulatory networks and signal transduction pathways,and ab initio protein folding, using Bayesian statistics and hierarchical/syntactic models similar to Chomsky’s grammars. The work emphasizes: the identification of Myc regulatory networks and pathways in cell-growth, cell proliferation, and apoptosis, on the basis of Microarray expression data, ChIp-chip data, and cross-species comparison; the identification of phosphorylation site motifs on the basis of tandem mass spectrometry, protein-protein interactions, and structural information of kinases and substrates; the ab initio protein folding using compositional/syntactic representations of proteins.
BIOGRAPHY
Basilis Gidas received his B.Sc. from the National Technical University of Athens Greece in 1965. He has an M.A. degree in Mathematics, M.S. degree in Physics, and Ph.D. degree in Mathematical Physics (1970), all from the University of Michigan. He is an elected Fellow of the Institute of Mathematical Statistics. Before he joined the Faculty at Brown in 1984, he held appointments at Rockefeller University and the Institute for Advanced Studies (Princeton). In the past, he has made contributions in Mathematical Physics (quantum field theory) and in partial differential equations/differential geometry. Since 1982 he has worked in Computer Vision, Speech Recognition, Nonparametric Statistics, and, the past eight years, in Computational Molecular Biology. He has served on the National Research Council Advisory Panel for "Spatial Statistics and Image Processing", and is on the editorial board of the International Journal of Imaging Science and Technology.
RESEARCH INTERESTS
Bayesian Statistics/Computer Vision/Speech Recognition
Metropolis-type Monte Carlo simulation algorithms and simulated annealing. Simulation and optimization via the Langevin equation. Markov Random Field (MRF) estimation and consistency of pseudo-likelihood estimators, and of maximum likelihood estimators from complete or incomplete data. A variational method for estimating MRFs. Nonparametric estimation for continuous-time stochastic processes arising in speech recognition. Object identification via classification trees and stochastic grammars. Renormalization group methods for multiscale/multilevel image processing. Texture representation via MRFs with polynomial interactions. Tracking of moving objects via particle filters. Speech signal representation via nonlinear transformations and wavelets. Classification and clustering of stop consonants via nonlinear transformation and nonlinear discriminant analysis.
Computational Molecular Biology
Probabilistic hierarchical/syntactic models (analogous to Chomsky grammars) for identifying, representing, and analyzing transcription regulatory networks and signal transduction pathways. Identification of genes regulated directly and indirectly by combining microarray expression data, ChIp-chip data, and cross-species comparison information; identification of downstream pathways through which Myc functions in cell growth, cell-cycle proliferation, and apoptosis. Identifying phosphorylation sites motifs on the basis of tandem mass spectrometry data, protein-protein interactions, and structural information about kinases and substrates. Protein representation and ab initio folding via hierarchical/syntactic (also known as compositional) models.
Current Projects in Computational Molecular Biology
Cellular processes such as cell-cycle, cell proliferation, apoptosis, cell-growth, cell differentiation, genome instability, cellular communication, and responses to external stimuli, are governed by interactions among DNA, proteins, RNAs, and a host of other molecules. Understanding the principles and the regulatory mechanisms underlying these processes is a central goal in biology. Our research addresses two aspects of the problem that have been studied extensively and seem to be within reach: (i) Transcription regulatory networks and downstream pathways through which transcription factors(TFs) function in specific cellular processes, and (ii) Signal Transduction pathways that transmit, process, and integrate external and internal signals. Our research address also structural proteomics -- especially the ab initioprotein folding problem. Advances in these problems the past few decades have been made possible by the genome sequencing of several species, and the rapid development of experimental technologies (such as microarrays, tile-arrays, ChIp-chip, real-time PCR, yeast two-hybrid assay, tandem mass spectrometry, NMR, and crystallography) as well as the development of recent tools such as RNAi screening and fluorescent proteins.
A complete understanding of the regulatory networks and signaling pathways entails mathematical/probabilistic models that articulate complex biochemical phenomena, and integrate multiple biological knowledge and experimental data from more than one technology. The models need to represent phenomena at multiple levels. At the local level, the models must articulate the spatio-temporal cooperation and coherence of complex interactions of DNA, proteins, RNAs, and signal transducers, as well as the spatial-temporal distributions and abundance profiles of the molecules; these dependencies underly the regulatory controls that determine, for example, gene expression profiles and cellular decisions such as apoptosis and transitions from one cell-cycle phase to the next. At the global level, the models must articulate global regularities or patterns that represent the "syntax" oroverall architecture of a network, pathway, or 3-D structure of a protein. The precise nature of the global and local aspects of a model is problem dependent. For example, a gene-finding model at the global level must represent the "syntax" of the concept "gene" as a collection of "motifs" or genomic sequences (e.g. TATA box, 5'UTR region, initial exon, alternating exon/intron, 3'UTR, Poly-A tail, intergenic regions, etc) {\em concatenated} according to precise but "random" rules that allow, for example, absence of TATA box, a single exon, or arbitrary number of exons; at the local level, the model must articulate the local variability of each motif or signal. Similar two level descriptions are necessary for models predicting the secondary structure of rRNAs or the 3-D structure of a protein. In transcription regulatory networks and pathways, the global representation includes the concatenation of a hierarchy of entities, e.g. small motifs or patterns that concatenate to form a module, which concatenate to form larger moduli, which in turn concatenate to form networks.
Bayesian Statistics and probability is a natural framework for designing both the local and global aspects of the models, and for accommodating multiple sources of data. The framework supports powerful computational algorithms such as dynamic programming and Monte Carlo type simulation and optimization algorithms. In many ways, the study of the problems for transcription regulation, signal pathways, and structure of proteins and RNAs, has a great deal of similarity to the study of computer vision, speech recognition, and other cognition problems. Our research aims at exploring existing and developing novel hierarchical/syntactic models similar to Chomsky grammars (that include HMM and context-free-grammars) for articulating the global properties of specific tasks in genomics, proteomics, and structural proteomics. Our current focus is on the following three projects:
1. Myc Network and Pathways: The c-MYC protein (a Transcription Factor) has been implicated in a number of biological processes including cell-growth, apoptosis, cell proliferation and cancer.It is believed that MYC regulates the expression of about 10-15% of human genes (more than any typical transcription factor). Some of these genes are regulated directly (MYC binds in the promoter or somewhere in the vicinity of a gene), while others are regulated indirectly (MYC regulates directly or indirectly genes of other Transcription Factors which, in turn, regulate a particular gene directly). Myc functions both as a transcription enhancer and a transcription repressor; moreover, there are indications that there exist regulated switches between "activation" and "repressor" Myc states, depending on the physiological state of a cell. As an enhancer, typically Myc acts by forming a heterodimer with Max and the Myc/Max dimer binds to canonical E-box; but Myc is known to bind to non-canonical motifs and it is believed to do so through other partners. The Myc/Max dimer does not bind to all canonical E-boxes of the genome, but prefers canonical E-boxes in CpG islands regions; moreover, Myc-bound loci are highly acetylated before binding. Enhancement by Myc/Max is antagonized by Max/Mad and Max/Mnt dimers which bind to same E-boxes as Myc/Max and inhibit transcription. As a repressor, Myc acts via the Myc/Max dimer forming a complex with Miz1 (possibly, with other proteins as well) which binds near the INR point.
Finding the genes targeted by Myc, correlating and quantifying the effect of Myc binding on gene expression level, identifying the crucial targets of Myc and assigning target genes involved in cell-cycle and apoptosis, are problems of fundamental interest. In our work we study these problems by exploring hierarchical models and employing Bayesian statistics computational algorithms that integrate three types of information or data:(i) Cross-species DNA sequence comparison (especially Human and mouse) to identify genome segments that have been conserved by evolution. Such regions typically have a functional role, and MYC binding sites tend to conserved by evolution; (ii) Chromatin Immunoprecipitation array (ChIp-chip) data; this high-throughput technology localizes MYC (or any specific Transcription Factor) binding sites within 1000-2000 DNA base pairs; we combine this information with known MYC motifs (E-box) and cross-species comparison information to find potential binding sites for MYC via a Monte Carlo type procedure; (iii) Gene expression microarray data; these data are employed to cluster genes into Myc target genes and genes that are not affected by MYC, as well as to group genes according to their expression profiles over time.
2. Phosphorylation Site Motifs: [section to be completed]
3. Ab Initio Protein Folding: Our program views the ab initioprediction of a protein's 3-D structure as a "coding" problem (which is often referred to in biology literature as the "second code" of biology). We believe that THE protein folding problem is analogous to the cortical representation (or "code") of languages, objects, scenes, and actions; these representations are believed to be hierarchical/syntactic or compositional in the sense pioneered by Chomsky. Proteins exhibit some natural hierarchies: atoms combine to form backbones and side chains; animo acids combine to form secondary structure elements, which, in turn, combine to form the overall tertiary and quarternary structures; moreover, helices have beginnings (N-caps or N-termini), cores, and ends (C-caps or C-termini). Beyond these hierarchies, proteins contain motifs or patterns that are central to their function; these include protein motifs (such as Helix-Turn-Helix or b/HLH/Zip patterns) involved in the recognition of DNA binding sites, and phosphorylation site motifs (on substrates) recognized by kinases and other substrate-binding proteins or molecules. Identifying the repertoire of protein motifs, understanding the rules by which they are concatenated in a protein, and understanding their interactions with DNA or other proteins and molecules, are problems far from being understood and may require new experimental techniques and the design of suitable libraries of "peptides". Our research focuses on the design of appropriate syntactic rules that articulatecontextual constraints, such as : secondary structure elements need to be compatible with the hydrophobic core and the hydrophilic exterior of the overall tertiary structure; edge and interior strands in β -sheets have distinct properties; the expected link length between turns depends on the protein class α / α , β / β , α / β (for example, α -helical segments bounded by turns contain twice as many residues as similar β-strand segments). The underpinning probabilistic model for incorporating these hierarchies is a compositional/syntactic model that contains Chomsky's context-free grammars, but is more computationally feasible than context-sensitive grammars. The computational algorithm involves a course-to-fine implementation whereby we start with a simplified representation of proteins and proceed to higher and higher resolution representation where more and more atomic details of protein and solvent are incorporated.
AFFILIATIONS
American Mathematical Society
American Statistical Society
Institute of Mathematical Statistics
LINKS
Awards
Elected Fellow of the Institute of Mathematical Statistics
National Research Council, Advisory Panel Member for “Spatial Statistics and Image Processing”, Board on Mathematical Sciences