Abstract: Divide-and-conquer methods, which divide the species set into overlapping subsets, construct trees on the subsets, and then combine the trees using a supertree method provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of these approaches. In this paper, we present a new divide-and-conquer approach which does not require supertree estimation: we divide the species set into disjoint subsets, construct trees on the subsets, and then combine the trees using an associated distance matrix computed on the full set of species. For this merger step, we present a new method, NJMerge, which is a polynomial time extension of the Neighbor-Joining algorithm. We report on the results of an extensive simulation study evaluating NJMerge’s utility in scaling three popular species tree estimation methods: ASTRAL, SVDquartets, and concatenation analysis using RAxML. We find that NJMerge provides substantial improvements in running time without sacrificing accuracy and sometimes even improving accuracy. Furthermore, although NJMerge can sometimes fail to return a tree, the failure rate in our experiments is less than 1%. Together, these results suggest that NJMerge is a valuable technique for scaling computationally intensive methods on large datasets and even moderate-sized datasets when computational resources are limited.
NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees