There was a message on the excellent EvolDir mailing list a few days ago about FastTree. This is a very fast neighbor-joining program for very large scale phylogenetic analyses. It uses profiles rather than a distance matrix and includes local support values instead of bootstraps. The examples in the preprint manuscript talk about datasets of from 10,610 to 167,547 aligned sequences. This 167k sequence alignment has 7,682 columns and the tree took 95 hours. The 10k dataset took 3 minutes! The preprint manuscript downloadable from the website below is well written and informative. I’m looking forward to testing this, hopefully in the next few days. I’ll post with my experiences. Below is the announcement.

———————————————————————

We are pleased to announce the initial release of FastTree, a tool for inferring neighbor joining trees from large alignments. FastTree is capable of computing trees for tens to hundreds of thousands of protein or nucleotide sequences on most desktop computers.

FastTree uses:

*profiles instead of a distance matrix to reduce memory usage

*linear distances with a character dissimilarity matrix

*a new “top hit” heuristic to achieve a sub N-squared run time

*local support instead of bootstrap for node support values

FastTree is faster than computing a distance matrix, and up to 10,000 times faster than neighbor joining with bootstrap. FastTree is about as accurate as BIONJ with log corrected distances for well-supported nodes.

To download source code, binaries, or a preprint, please visit:

http://www.microbesonline.org/fasttree/

Abstract

Background: A fundamental goal of molecular evolution is to infer the evolutionary history the phylogeny of sequences from their alignment. Neighbor joining, which is a standard method for inferring large phylogenies, takes as its input the distances between all pairs of sequences. The distance matrix requires O(N^2 L) time to compute and O(N^2) memory to store, where N is the number of sequences and L is the width of the alignment. As some families already contain over 100,000 sequences, these time and space requirements are prohibitive.

Results: We show that neighbor-joining can be implemented in O(NLa) space, where ‘a’ is the size of the alphabet, by storing profiles of internal nodes in the tree instead of storing a distance matrix. Profile based neighbor joining allows weighted joins, as in BIONJ, but requires that distances be linear. With heuristic search, neighbor joining with profiles takes only O(N*SQRT(N) log(N)La) time. We estimate the confidence of each split (A,B) vs. (C,D) from the profiles of A, B, C, and D, without bootstrapping. Our implementation, FastTree, has similar accuracy as traditional neighbor joining. FastTree constructed trees, including support values, for biological alignments with 39,092 or 167,547 distinct sequences in less time than it takes to compute the distance matrix and in a fraction of the space. Traditional neighbor joining with 100 bootstraps would be 10,000 times slower.

Conclusions: Neighbor joining with profiles makes it possible to construct phylogenies for the largest sequence families and to estimate their reliability.

Morgan N. Price & Paramvir S. Dehal

fasttree@microbesonline.org

Virtual Institute for Microbial Stress and Survival

Arkin Lab

Physical Biosciences Division

Lawrence Berkeley National Lab

psdehal@gmail.com

———————————————————————