Research Overview

We are working on the fundamental problem of comparative genomics: the determination of the origins and evolutionary history of the nucleotides in all extant genomes. Our work incorporates various aspects of genomics, including the reconstruction of ancestral genomes (paleogenomics), the modeling of genome dynamics (phylogenomics and systems biology) and the assignment of function to genome elements (functional genomics). More detail on some of our projects is provided below.

In addition to working on methodology and mathematical foundations for comparative genomics, we also work with genome projects and perform large scale computational analyses. We have been members of the mouse, rat, chicken and fly genome sequencing consortia, and have participated in the ENCODE consortium.


Alignment Methods

We have been working on algorithms for genome alignment since 1997, when we introduced the first global alignment program for long genomic sequences, followed by AVID (the alignment program used by VISTA), and then MAVID for multiple alignment. MAVID is based on progressive alignment and uses constrained ancestral alignment to obtain high accuracy and speed. The program can be accessed through a website, and the source code is also freely available for download.

In a departure from progressive alignment methods, we have recently introduced a new approach to sequence alignment we call sequence annealing. AMAP is a multiple alignment program based on this approach, and is also available for use via a webserver or directly using the source code. There is also a YouTube demonstration (set to Prokofiev music).

Algebraic Statistics for Computational Biology

The quantitative analysis of biological sequence data is based on methods from statistics coupled with efficient algorithms from computer science. Algebra provides a framework for unifying many of the seemingly disparate techniques used by computational biologists. Together with Bernd Sturfmels, we have written and edited a book that offers an introduction to this mathematical framework and describes tools from computational algebra for designing new algorithms for exact, accurate results. These algorithms can be applied to biological problems such as aligning genomes, finding genes and constructing phylogenies. Since the publication of the book, a number of conjectures we proposed have been solved, and numerous results have been extended.

Genome Dynamics and Regulation

We are interested in the function and evolution of elements involved in the regulation of genomes. These include transcription factors and binding sites, microRNAs and their targets, transposable elements, etc. We also study the role of ultra-conserved elements such as the meaning of life.

In a paper together with Eddy Rubin, we introduced phylogenetic shadowing and argued that primates are suitable (and desirable) for studying regulatory elements in humans. The related generalized hidden Markov phylogeny provides a graphical model framework for shadowing with distinct functional elements.

Together with Mary-Lee Dequéant, Olivier Pourquie and Bernd Sturmfels, we are investigating approaches for identifying genes with periodic expression that are involved in the somitogenesis clock of vertebrates.

Inspired by numerous recent results indicating that insertions and deletions play a major role in genome evolution, we have been developing methods for stuyding the entire range of indel events from micro-indels to transposable elements.

Gene Finding

We have developed two software programs for gene finding. SLAM is a comparative-based gene finder for simultaneous gene finding an alignment. Inference is performed using a generalized pair hidden Markov model. SLAM was used to annotate the human, mouse and rat genomes, and the resulting predictions are being used in the Affymetrix GeneChip Human Exon 1.0 ST Array.

The program GeneMapper is suitable for reference-based gene annotation where a well-annotated finished genome is used to annotate a newly sequenced genome. It has been used to annotate the Drosophila genomes, as well as the chicken and other vertebrate genomes.

Metagenomics

Metagenomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species. We are interested in bioinformatics problems inspired by metagenomics challenges, and have recently been working on viral population estimation using pyrosequencing.

Phylogenetics

Our interest in phylogenetic stems from its connections to genome alignment and comparative genomics. We are interested in the development of evolutionary models for functional elements, reconstruction algorithms for large numbers of taxa, and whole genome phylogenetics.

In a recent paper we have tried to answer the question: why does neighbor joining work? Our main result is (roughly) that if neighbor joining works locally then it succeeds globally. Our analyses provide a theoretical explanation for the empirically observed success (and therefore widespread use) of the neighbor joining algorithm, and open the door for the development of statistically sound quartet methods in phylogenetics. They have also allowed us to settle in the affirmative a conjecture of Atteson (from 1998) on the edge radius of the neighbor joining algorithm.

We have also been interested in the use of phylogenetic diversity estimates, both in terms of their relevance for sequencing strategies, and for phylogenetic reconstruction.

Population Genetics

As part of the program on Fundamental Laws in Biology, we are working with Niko Beerenwinkel, Bernd Sturmfels and Richard Lenski on the study of fitness landscapes and epistasis. In the paper epistasis and the shapes of fitness landscapes, we provide a geometric framework for describing epistatic interactions, and bridge the gap between discrete genotype spaces, and continuous fitness landscapes. We have used this approach to analyze a fitness landscape of Escherichia coli (obtained by Elena and Lenski) and find a strong correlation between epistasis and the average fitness loss caused by deleterious mutations.

We have recently been studying low dimensional projections of the human genotope, with a view towards organizing SNP data (obtained from the HapMap project) for analysis of interaction.

Whole Genome Alignment

We are working together with Colin Dewey on methods for whole genome alignment at the nucleotide level. Dewey's Mercator homology mapping program has served as the basis for numerous whole genome alignments, including vertebrate alignments (where it was used for the ENCODE project sequence freezes), fly alignments and worm alignments (subsequently used to establish a genome-wide map of conserved microRNA targets).

We have also spearheaded the application of parametric alignment methods for whole genome alignment.