Neighbor
Joining Explained
First described in 1987
by Saitou and Nei, neighbor-joining is a widely popular clustering
method for reconstructing phylogenetic trees. One reason for its
popularity is that neighbor joining is very fast. Another reason is
that neighbor joining is surprisingly accurate.
The mostly widely cited reason for the success of neighbor joining is
a consistency proof due to Kevin
Atteson. Atteson's theorem states that if the
approximated distance matrix differs from the true distance matrix by
less than half the length of the shortest edge of the tree, then
neighbor joining will return the correct tree. What's important
about this condition is that if the method for generating distances is
consistent, then so is neighbor joining. If adding more
information gives better distances, then given enough information,
neighbor-joining will return the correct tree.
The downside is that Atteson's condition is too
restrictive. Specifically, it excludes deformations that only
involve modifying the lengths of edges, deformations that preserve
the topology of the tree. If you double or halve any single edge
of the tree, then the resulting distance matrix will violate Atteson's
condition, but neighbor-joining will still return the correct topology.
It turns out that it is not distances, but differences of distances
that are important for determining the success of
neighbor-joining. We discovered two conditions on distances from
four and six leaf subtrees that, if satisfied, assure the success of
neighbor joining. In practical tests, we found that our conditions
outperformed Atteson by an order of magnitude in the length of the
sequence.