Research Projects

 

The Neighbor-Net Algorithm

First described by Bryant and Moulton in 2004,  the neighbor-net algorithm constructs tree-like networks called circular split systems.  Essentially, a circular split system is a circular ordering on the elements in your network.  Once the ordering is determined, the only splits allowed are ones that can divide the circular ordering cleanly into two pieces.  Like neighbor joining, neighbor-net is an agglomerate distance based method, using a cherry-picking function that is nearly identical to that of neighbor-joining.  Pictured above  is a circular splits network on Papuan and Austronesian languages built by SplitsTree from a distance matrix of grammatical dissimilarity. 

There is a simple characterization of metrics that come from circular split systems.  First, we require that the distance matrix be a metric and second, that it satisfy the Kalmanson condition.  The Kalmanson condition says that of the three ways to connect four elements with two edges, the one where the edges cross has the longest total length.  We were able to show that if a distance matrix satisfies the Kalmanson condition, even if it is not a metric, an application of neighbor-net will return the correct circular ordering.  That is, neighbor-net is consistent and robust.

Also, it has been shown that neighbor-joining is a greedy minimization of the balanced minimum evolution criteria.  The extension of this idea to circular splits metrics shows that for a particular choice of distance update, neighbor-net will perform a greedy maximization of the traveling salesman problem.  We were also able to identify geometric and combinatorial properties of the space of networks explored by the algorithm. 

 

Neighbor Joining Explained

First described in 1987 by Saitou and Nei, neighbor-joining is a widely popular clustering method for reconstructing phylogenetic trees.  One reason for its popularity is that neighbor joining is very fast.  Another reason is that neighbor joining is surprisingly accurate.

The mostly widely cited reason for the success of neighbor joining is a consistency proof due to Kevin Atteson.   Atteson's theorem  states that if the approximated distance matrix differs from the true distance matrix by less than half the length of the shortest edge of the tree, then neighbor joining will return the correct tree.  What's important about this condition is that if the method for generating distances is consistent, then so is neighbor joining.  If adding more information gives better distances, then given enough information, neighbor-joining will return the correct tree.

The downside is that Atteson's condition is too  restrictive.  Specifically, it excludes deformations that only involve modifying the lengths of edges, deformations that preserve the topology of the tree.  If you double or halve any single edge of the tree, then the resulting distance matrix will violate Atteson's condition, but neighbor-joining will still return the correct topology.

It turns out that it is not distances, but differences of distances that are important for determining the success of neighbor-joining.  We discovered two conditions on distances from four and six leaf subtrees that, if satisfied, assure the success of neighbor joining.  In practical tests, we found that our conditions outperformed Atteson by an order of magnitude in the length of the sequence.

 

Phylogenetic Trees

To reconstruct a phylogenetic tree from information at the leaves, you can employ one of two methods.  You may choose to consider every possible tree, compute the probability of seeing the information at the leaves, and find the tree that maximizes this probability.  These algorithms are the maximum likelihood methods.  The other possibility is to consider each pair of leaves, compute a distance between them, and then use those distances to derive a tree.  There are the distance based methods.

While maximum likelihood methods are very accurate, they are computationally prohibitive in cases where there are a large number of taxa.  Distance based methods, on the other hand, are exceptionally fast; however, their results are less reliable than MLE methods.

Our Generalized Neighbor Joining algorithm bridges the gap between these two approaches.  Instead of looking at pairs of leaves and the distance between them (the orange path between i and j in the figure above), GNJ considers subsets of leaves of size m.  We use a maximum likelihood method to reconstruct the subtree that spans those leaves (in the figure this is the blue subtree where m = 7).  We then compute the dissimilarity of those m leaves by taking the sum of the edge lengths in the subtree.  The GNJ algorithm then compute the original tree from these dissimilarity values in polynomial time.

 

Radiation Cytogenetics

Radiation cytogenetics is the study of chromosomal aberrations caused by radiation exposure.  To study such aberrations, biologists employ a variety of staining techniques, such as mFISH (pictured here to the right).  The Chromosome Aberration Analyzer (CAA) applies a set of formal graph theoretic techniques to systematically analyze mFISH patterns.  For details, you can go to the radiobiology website or, you may run the applet directly and take your chances.  

We are at work on a new, not yet ready for primetime, Java version of the Chromosome Aberration Simulator (CAS).  The purpose of CAS is to simulate radiation events, cellular repair mechanisms, and chromosome painting protocols.  In this way, we can perform an entire radiogenic experiment inside a computer processor.  The new Java version will increase the flexibility and range of possible experiments and provide a user friendly graphical interface for various levels of user.    The original FORTRAN version of CAS is available here.