Leonid Chindelevitch and Joao Meidanis. A cubic algorithm for the generalized rank median of three genomes

Abstract: The area of genome rearrangements has given rise to a number of interesting biological, mathematical and algorithmic problems. Among these, one of the most intractable ones has been that of finding the median of three genomes, a special case of the ancestral reconstruction problem. In this work we examine a recently proposed way of measuring genome rearrangement distance, namely, the rank distance between the matrix representations of the corresponding genomes, and show that the median of three genomes can be computed exactly in polynomial time $O(n^\omega)$, where $\omega \leq 3$, with respect to this distance, when the median is allowed to be an arbitrary orthogonal matrix.

We define the five fundamental subspaces defined by three input genomes, and use their properties to show that a particular action on each of these subspaces produces a median. In the process we introduce the notion of $M$-stable subspaces. We also show that the median found by our algorithm is always orthogonal, symmetric, and conserves any adjacencies or telomeres present in at least 2 out of 3 input genomes.

We test our method on both simulated and data. We find that the majority of the realistic inputs result in genomic outputs, and for those that do not, our two heuristics perform well in terms of reconstructing a genomic matrix attaining a score close to the lower bound, while running in a reasonable amount of time. We conclude that the rank distance is not only theoretically intriguing, but also practically useful for median-finding, and potentially ancestral genome reconstruction.

return to ‘Accepted Papers’


Dikshant Pradhan and Mohammed El-Kebir. On the Non-uniqueness of Solutions to the Perfect Phylogeny Mixture Problem

Abstract: Tumors exhibit extensive intra-tumor heterogeneity, the presence of groups of cellular populations with distinct sets of somatic mutations. This heterogeneity is the result of an evolutionary process, which is described by a phylogenetic tree. The problem of reconstructing a phylogenetic tree T given bulk sequencing data from a tumor is more complicated than the classic phylogeny inference problem. That is, rather than observing the leaves of T directly, we are given mutation frequencies that are the result of mixtures of the leaves of T. The majority of current tumor phylogeny inference methods employ the perfect phylogeny evolutionary model.
In this work, we show that the underlying PERFECT PHYLOGENY MIXTURE combinatorial problem typically has multiple solutions.
We provide a polynomial-time computable upper bound on the number of solutions. We use simulations to identify factors that contribute to and counteract non-uniqueness of solutions. In addition, we study the sampling performance of current methods, identifying significant biases.

return to ‘Accepted Papers’


Hussein Hejase, Natalie Vandepol, Gregory Bonito and Kevin Liu. Fast and accurate statistical inference of phylogenetic networks using large-scale genomic sequence data

Abstract: An emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organisms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal” gene flow? The answer to this fundamental question not only depends upon densely sampled and divergent genomic sequence data, but also computational methods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Recent methodological advances have attempted to address this gap. However, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phylogenomic studies.

The methodological gap remains: how can phylogenetic networks be accurately and efficiently inferred using genomic sequence data involving many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolutionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topological accuracy.

return to ‘Accepted Papers’


Sebastien Roch. On the variance of internode distance under the multispecies coalescent

Abstract: We consider the problem of estimating species trees from unrooted gene tree topologies in the presence of incomplete lineage sorting, a common phenomenon that creates gene tree heterogeneity in multilocus datasets. One popular class of reconstruction methods in this set- ting is based on internode distances, i.e. the average graph distance between pairs of species across gene trees. While statistical consistency in the limit of large numbers of loci has been established in some cases, little is known about the sample complexity of such methods. Here we make progress on this question by deriving a lower bound on the worst-case variance of internode distance which depends linearly on the corresponding graph distance in the species tree. We also discuss some algorithmic implications.

return to ‘Accepted Papers’


Mattéo Delabre, Nadia El-Mabrouk, Katharina Huber, Manuel Lafond, Vincent Moulton, Emmanuel Noutahi and Miguel Sautie Castellanos. Reconstructing the history of syntenies through Super-Reconciliation

Abstract: Classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is clearly not suited for genes grouped in syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation model, which extends the traditional Duplication-Loss model to the reconciliation of a set of trees, accounting for segmental duplications and losses. From a complexity point of view, we show the associated decision problem is NP-hard. We then give an exact exponential-time algorithm for this problem, test its time efficiency on simulated datasets, and give a proof of concept on the opioid receptor genes.

return to ‘Accepted Papers’


Xian Fan, Jie Xu and Luay Nakhleh. Detecting Large Indels Using Optical Map Data

Abstract: Optical Maps (OM) provide reads that are very long, and thus can be used to detect large indels not detectable by the shorter reads provided
by sequence-based technologies such as Illumina and PacBio. Two existing tools for detecting large indels from OM data are BioNano Solve and OMSV. However, these two
tools may miss indels with weak signals. We propose a local-assembly based approach, OMIndel, to detect large indels with OM data. The results of applying OMIndel to empirical data demonstrate
that it is able to detect indels with weak signal. Furthermore, compared with the other two OM-based methods, OMIndel has a lower false discovery rate. We
also investigated the indels that can only be detected by OM but not Illumina, PacBio or 10X, and we found that they mostly fall into two categories: complex events or indels on repetitive regions. This implies that adding the OM data to sequence-based technologies can provide significant progress towards a more complete characterization of structural variants (SVs). The algorithm has been implemented in Perl and is publicly available on https://bitbucket.org/xianfan/optmethod.

return to ‘Accepted Papers’


Erin Molloy and Tandy Warnow. NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Abstract: Divide-and-conquer methods, which divide the species set into overlapping subsets, construct trees on the subsets, and then combine the trees using a supertree method provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of these approaches. In this paper, we present a new divide-and-conquer approach which does not require supertree estimation: we divide the species set into disjoint subsets, construct trees on the subsets, and then combine the trees using an associated distance matrix computed on the full set of species. For this merger step, we present a new method, NJMerge, which is a polynomial time extension of the Neighbor-Joining algorithm. We report on the results of an extensive simulation study evaluating NJMerge’s utility in scaling three popular species tree estimation methods: ASTRAL, SVDquartets, and concatenation analysis using RAxML. We find that NJMerge provides substantial improvements in running time without sacrificing accuracy and sometimes even improving accuracy. Furthermore, although NJMerge can sometimes fail to return a tree, the failure rate in our experiments is less than 1%. Together, these results suggest that NJMerge is a valuable technique for scaling computationally intensive methods on large datasets and even moderate-sized datasets when computational resources are limited.

return to ‘Accepted Papers’


Mathias Weller. Linear-Time Tree Containment in Phylogenetic Networks

Abstract: We consider the NP-hard Tree Containment problem that has important applications in phylogenetics. The problem asks if a given single-rooted leaf-labeled network (“phylogenetic network”) N contains a subdivision of a given leaf-labeled tree (“phylogenetic tree”) T . We develop a fast algorithm for the case that N is a phylogenetic tree in which multiple leaves might share a label. Generalizing a previously known decomposition scheme lets us leverage this algorithm, yielding linear-time algorithms for so-called “reticulation visible” networks and “nearly stable” networks. While these are special classes of networks, they rank among
the most general of the previously considered cases. We also present a dynamic programming algorithm that solves the general problem in O(3^t·|N|·|T|) time, where the parameter t is the maximum number of tree components with unstable roots in any block of the input network. Notably, t is stronger (that is, smaller on all networks) than the previously considered parameter “number of reticulations” and even the popular parameter “level” of the input network.

return to ‘Accepted Papers’


Pijus Simonaitis, Annie Chateau and Krister Swenson. A General Framework for Genome Rearrangement with Biological Constraints

Abstract: This paper generalizes previous studies on genome rearrangement under biological constraints, using double cut and join (DCJ). We propose a model for weighted DCJ, along with a family of optimization problems called φ-MCPS (Minimum Cost Parsimonious Scenario), that are based on edge labeled graphs. After embedding known results in our framework, we show how to compute solutions to general instances of φ-MCPS, given an algorithm to compute φ-MCPS on a circular genome with exactly one occurrence of each gene. These general instances can have an arbitrary number of circular and linear chromosomes, and arbitrary gene content. The practicality of the framework is displayed by generalizing the results of Bulteau, Fertin, and Tannier on the “Sorting by wDCJs and indels in intergenes” problem, and by generalizing previous results on the Minimum Local Parsimonious Scenario problem.

return to ‘Accepted Papers’


Md Abid Hasan and Stefano Lonardi. mClass: Cancer type classification with somatic point mutation data

Abstract: Cancer is a complex disease associated with abnormal DNA mutations. Not all tumors are cancerous and not all cancers are the same. Correct cancer type diagnosis can indicate the most effective drug therapy and increase survival rate. At the molecular level, it has been shown that cancer type classification can be carried out from the analysis of somatic point mutation. However, the high dimensionality and sparsity of genomic mutation data, coupled with its small sample size has been a hindrance in accurate classification of cancer. We address these problems by introducing a novel classification method called mClass that accounts for the sparsity of the data. mClass is a feature selection method that ranks genes based on their similarity across samples and employs their normalized mutual information to determine the set of genes that provide optimal classification accuracy. Experimental results on TCGA datasets show that mClass significantly improves testing accuracy compared to DeepGene, which is the state-of-the-art in cancer-type classification based on somatic mutation data. In addition, when compared with other cancer gene prediction tools the set of genes selected by mClass contains the highest number of genes in top 100 genes listed in the Cancer Gene Census.

return to ‘Accepted Papers’