Student Algebraic Statistics Seminar

The Student Algebraic Statistics Seminar meets Mondays, 12pm-1pm, in 939 Evans. This is an informal, working seminar, and most speakers present work in progress. It is organized by myself (Jason Morton) and Bernd Sturmfels. If you would like to register for the seminar, please use CCN 54997.



Schedule

May 6, 2007. Ruchira Datta, U.C. Berkeley. Applying High-Dimensional Clustering Methods for Phylogenetic Profiling

The phylogenetic profile of a gene is the set of species in which it, or one of its orthologs, occurs. We would like to cluster together genes using their phylogenetic profiles, in the hope that genes with similar phylogenetic profiles may have similar functions. In this talk we explain a few methods for clustering high-dimensional data. We describe how to use the Jaccard measure to compute a similarity matrix between genes, and then perform hierarchical agglomerative clustering. This gives us a forest from which we can read off clusters at the desired level of granularity. Since this technique is computationally expensive, we also discuss faster methods based on similarity hashing (simhash). Time permitting, we will discuss how to incorporate a phylogenetic tree of the species into the design of the simhash, yielding a computationally inexpensive, biologically informative clustering method. This is work in progress, joint with Jonathan Eisen and Amber Hartman of UC Davis.


10am, Wednesday, Mar. 28, 2007 in 891 Evans. Lior Pachter, U.C. Berkeley. From Drosophila and Transposable Elements to Phylogenetic Networks.

We begin with an overview of the Drosophila genome project, whose goal is the sequencing and comparison of 12 fruit fly genomes. In particular, we describe some of the dynamic behavior of transposable elements. These are self-replicating sequences that play a major role in shaping the structure and function of genomes. Our methods for studying transposable elements lead naturally to the analysis of split systems and their associated phylogenetic networks. We discuss various aspects of the neighbor-net algorithm, including its interpretation as a greedy algorithm for the traveling-salesman problem, how to obtain statistically meaningful parameters, and how to interpret its output. The application of neighbor-net to the split system we obtain from transposable elements in Drosophila reveals interesting insights about a set of species that may have undergone lineage sorting. This is joint work with Anat Caspi and Dan Levy.


Mar. 19, 2007. Serkan Hosten, SFSU. Extended UPGMA for Equidistant Tree Reconstruction.

We study the UPGMA algorithm for equidistant phylogenetic tree reconstruction using the geometry of the space of all equidistant trees with $n$ leaves, also known as the Bergman complex of the graphical matroid for the complete graph $K_n$. We show that this classic algorithm performs an orthogonal projection of the data (empirical distance data among $n$ taxa) onto a distinguished cell of the Bergman complex. We also show that the equidistant tree with the least (Euclidean) distance from the data is obtained from such an orthogonal projection, but not necessarily given by UPGMA algorithm. We give two extensions of this algorithm which use the geometry of the Bergman complex.


Mar. 12, 2007. Dustin Cartwright, U.C. Berkeley. Physical and Genetic Mapping.

I will present work on two problems in computational biology: fingerprint-based physical mapping and genetic mapping. The former consists of determining the arrangement of DNA fragments based on fingerprints which give information about the pairwise overlaps. The latter uses correlations between genetic markers in families to determine their positions on the chromosomes. While these problems have practical motivations, I hope to highlight their connections to algebraic statistics and combinatorics.


Feb. 26, 2007. Bernd Sturmfels, U.C. Berkeley. Open problems in algebraic statistics.

This talk introduces five or six mathematical problems whose solution would likely be a significant contribution to the emerging interactions between algebraic geometry, statistics, and computational biology.


Feb. 12, 2007. Peter Huggins, U.C. Berkeley. Genotopes, archetypes, and clustering.

In this talk I'll give a short tour of a series of recent projects that marry all of the topics in the title. Genotopes are polytopes defined by genotypes, and triangulations of genotopes generalize epistasis to systems of more than two interacting genes. Points in the genotope are related to populations. Archetypes, being the analog of PCA components inside polytopes, are thus well-suited for finding representative populations inside the genotope. Archetypes may also offer advantages for clustering applications in general; I'll conclude the talk with a snapshot of an ongoing project to apply archetypal analysis to clustering applications in breast cancer research.


Feb. 5, 2007. Yuan Yao, Stanford University. Metric Learning for Phylogenetic Invariants.

We report some exploratory studies on a new method for model selection by combining algebraic statistics with machine learning. Algebraic statistics is concerned with the polynomials which vanish on a statistical model. One major hurdle in the practical use of these polynomials is our lack of understanding of their statistical properties. We avoid this problem by generating simulated data and using metric learning to find an optimal combination of these polynomials which distinguishes different models. Such a combination, represented as a metric on a subspace of invariants, can be found using semidefinite programming techniques, which try to minimize the norm of the points corresponding to the correct model while maximizing the norm of the incorrect models. Our running example is constructing a phylogenetic tree on four taxa under the Jukes-Cantor model of DNA evolution. Our method performs much better than any previous invariant-based method and is competitive or better than neighbor-joining.


Jan. 29, 2007. Bernd Sturmfels, U.C. Berkeley. Can Biology Lead to New Theorems?

This lecture argues for an affirmative answer to the question in the title. In future interactions between mathematics and biology, both fields will contribute to each other, and, in particular, research in the life sciences will inspire new theorems in pure mathematics. This opinion is illustrated by a snapshot of four recent contributions from biology to geometry, combinatorics and algebra.


Jan. 22, 2007. Anne Shiu, U.C. Berkeley. Geometry of Rank Tests.

Convex rank tests are coarsenings of the hyperplane arrangement associated to the symmetric group. Semigraphoids are combinatorial structures corresponding to squares and hexagons on the permutohedron. We prove that these objects - convex rank tests and semigraphoids - are equivalent and use this result to answer a question posed by Postnikov, Reiner, and Williams. This work originated in collaboration with the Pourquie lab at the Stowers Institute in Kansas City. We describe the experiments conducted to identify the molecular components of biological clocks, and explain how the convex rank test called the cyclohedron test contributes to time course microarray data analysis.



Past Semesters

Dec. 11, 2006. Ruchira Datta, U.C. Berkeley. Polynomial Graphs with Applications to Game Theory.

In this talk we introduce a set of conditions a system of polynomial equations in several variables may satisfy, which are encoded in an associated graph, the polynomial graph. We explain a theorem describing the number of solutions to the system in this case, and show how this can be applied to graphical games. Time permitting, we will also describe emergent node tree structures (ENT structures) on games, a new model for games in which the players can be hierarchically decomposed into groups, along with a new solution concept for these games which refines Nash equilibria. We describe how the theorem applies to games with ENT structure.


Dec. 4, 2006. Josephine Yu, U.C. Berkeley. QEPCAD: Quantifier Elimination by Partial Cylindrical Algebraic Decomposition.

The quantifier elimination problem (QE) is the problem of transforming a quantified boolean expression of polynomial equations and inequalities (with integer coefficients) into an equivalent expression without quantifiers. The partial cylindrical algebraic decomposition (PCAD) is an algorithm for solving this problem. I will talk about how this algorithm works and how to use the software QEPCAD.


Nov. 27, 2006. Aubrey Clayton, U.C. Berkeley. Mutation/Selection Balance and Chemical Reaction Networks.

Two opposing forces in population dynamics are genetic mutation and natural selection. Under certain assumptions, a model for the interaction of these can be derived, in which the number of copies of a certain mutation present in a randomly selected individual is a Poisson random variable, and the means of these random variables satisfy a system of ODE's. Finding equilibria for this dynamical system involves solving a system of polynomial equations in the first orthant. By coincidence, the same kinds of polynomial equations arise in the study of chemical reaction networks, where tools exist for determining whether a system has a unique equilibrium point. I will briefly describe the model and illustrate how to use these tools.


Nov. 20, 2006. Nicholas Eriksson, MSRI/Stanford. Conjunctive Bayesian Networks.

Conjunctive Bayesian networks (CBNs) are graphical models that describe the accumulation of events which are constrained in the order of their occurrence. A CBN is given by a partial order on a (finite) set of events. CBNs generalize the oncogenetic tree models of Desper et al. by allowing the occurrence of an event to depend on more than one predecessor event. I will present a combinatorial solution to the model selection problem for CBNs, discuss how to deal with noisy data, and analyze two datasets where the events are HIV mutations associated with drug resistance. Finally, I will explain why CBNs are nice from the standpoint of algebraic statistics (the phrase ``a CBN is a toric variety with a quadratic Gr\"obner basis'' will be explained). This is joint work with Niko Beerenwinkel and Bernd Sturmfels.


Nov. 13, 2006. Jason Morton, U.C. Berkeley. The toric geometry of conditional probability.

Consider a probability space with $n+1$ disjoint events. The compactification of the space of possible probability distributions on these events is $\mathbb{P}^{n}$, which is the toric variety associated to the $n$-dimensional simplex (the probability simplex). Similarly, compactification of the space of conditional probability distributions defines a toric variety that has a natural embedding in a product of projective spaces and a nice Gr\"obner basis. This toric variety is a blowup of $\mathbb{P}^n$ and has a permutohedron or generalized permutohedron as its polytope.


Nov. 6, 2006. Debbie Yuster, Columbia University. Classifying disease models using regular polyhedral subdivisions.

Genes play a complicated role in how likely one is to get a certain disease. Biologists would like to model how one's genotype affects their likelihood of illness. I will discuss some classical disease models, as well as our new classification of possible disease models, where each model corresponds to an induced subdivision of a point configuration (basically a picture of connected dots). No background in biology or polyhedral subdivisions required. This work is joint with Inga Hallgrimsdottir.