Tricks for trees: Having reconstructed phylogenies what can we do with them? Mike Steel Allan Wilson Centre for Molecular Ecology and Evolution Biomathematics Research Centre University of Canterbury, Christchurch, New Zealand DIMACS, June 2006 1 Where are phylogenetic trees used?

Evolutionary biology species relationships, dating divergences, speciation processes, molecular evolution. Ecology classifying new species; biodiversity, co-phylogeny, migration of populations. Epidemiology systematics, processes, dynamics Extras - linguistics, stematology, psychology. 2 Phylogenetic trees

[Definition] A phylogenetic X-tree is a tree T=(V,E) with a set X of labelled leaves, and all other vertices unlabelled and of degree >3. If all non-leaf vertices have degree 3 then T is binary 3 Trees and splits 3 1 2 e Ae | Be 4

5 (T ) { Ae | Be : e E} 6 Partial order: ( PX , ) T T ' (T ) (T ' ) Bunemans Theorem 4 Quartet trees A quartet tree is a binary phylogenetic tree on 4 leaves (say, x,y,w,z) written xy|wz. x w y

z A phylogenetic X-tree displays xy|wz if there is an edge in T whose deletion separates {x,y} from {w,z} x y r w z s u 5 Corresponding notions for rooted trees

Clusters (in place of splits) Triples in place of quartets 6 How are trees useful in epidemiology? Systematics and reconstruction

How are different types/strains of a virus related? When, where, and how did they arise? What is their likely future evolution? What was the ancestral sequence? 7 How are trees useful in epidemiology? Processes and dynamics (Phylodynamics)

How do viruses change with time in a population? Population size etc What is their rate of mutation, recombination, selection? Within-host dynamcs

How do viruses evolve in a single patient? How is this related to the progression of the disease? How much compartmental variation exists? 8 What do the shapes of these trees tell us about the processes governing their evolution? Eg. Population dynamics, selection Coalescent prediction 10 Tree shapes (non-metric)

George Yule a b c d e 11 Why do trees on the same taxa disagree? Model violation 1.

1. 2. 3. 2. 3. 4. 1. 2. 3. 4. true model differs from assumed model true model = assumed model but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!)

Alignment error Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 13 Sampling error thats hard to deal with T1 T2 T3 T4

Time ? 14 Example: Deep divergence in the Metazoan phylogeny Deuterostomes Cnidaria Ustilago Arthropods

C rustacea Urochordata Annelida C ephalochordata Mollusca Echinodermata Glossina Anopheles Mammalia Drosophila Actinopter Coleoptera Hymenoptera Hemiptera Siphonaptera Lepidoptera P hanerochaete

C ryptococcus Schizosaccharomyces Chelicerata Saccharomyces Tardigrades Candida P aracooccidioides Strongyloides Gibberella Neurospora Magnaporth

Heterodera Ascaris Meloidogyne Brugia Glomus P ristionchus Ancylostoma Neocallimastix C aenorhabditis briggsae C aenorhabditis elegans Fungi Trichinella Monosiga brevicollis Monosiga ovata

C tenophora Choanoflagellates From Huson and Bryant, 2006 Echinococcus Fasciola Schistosoma mansoni Schistosoma japonicum Dugesia Nematodes Platyhelminthes 15

Models 2 1 1 3 vs 2 4 3 4

Finite state Markov process 1 k 2 16 Models 3 1 3 1 vs

2 2 4 4 site saturation subdividing long edges only offers a partial remedy (trade-off). 17 Why do trees on the same taxa disagree? Model violation 1. 1. 2.

3. 2. 3. 4. 1. 2. 3. 4. true model differs from assumed model true model = assumed model but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!) Alignment

Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 18 Gene trees vs species trees a Theorem b c a

b c J. H. Degnan and N.A. Rosenberg, 2006. For n>5, for any tree, there are branch lengths and population sizes for which the most likely gene tree is different from the species tree. Discordance of species trees with their most likely gene trees. PLoS Genetics, 2(5), e68 May, 2006 19 Example ? Orangutan Gorilla

Chimpanzee Adapted From the Tree of the Life Website, University of Arizona Human 20 C Distinguishing between signals A Lineage sorting vs sampling error vs HGT

B A C B A C B 21 Why do trees on the same taxa disagree? Model violation 1. 1.

2. 3. 2. 3. 4. 1. 2. 3. 4. true model differs from assumed model true model = assumed model but estimation method not appropriate to model model true but too parameter rich (non-identifyability) Sampling error (and factors that make it worse!) Alignment

Evolutionary processes Lineage sorting Recombination Horizontal gene transfer; hybrid taxa Gene duplication and loss 22 Given a tree what questions might we want to answer? How reliable is a split? Where is the root of the tree? Relative ranking of vertices? Dating? How well supported is some deep divergence resolved?

What model best describes the evolution of the sequences (molecular clock? dS/dN ratio constant? etc) Statistical approaches: Non-parametric bootstrap Parametric bootstrap Likelihood ratio tests Bayesian posterior probabilities Tests (KH, SH, SOWH) Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Systematic Biology 49: 652-670. 23 From Steve Thompson, Florida State Uni 24 Example

25 Non-parametric bootstrap 26 27 Dealing with incompatibility: Consensus trees Strict Majority rule Semistrict consensus

28 Consensus networks Take the splits that are in at least x% of the trees and represent them by a graph Splits Graph (G()) Dress and Huson Each split is represented by a class of parallel edges Simplest example (n=4). 29 (NS)

(NS) (SS) (A) (A) (SS) (NS) (NS) (SS) (SS) (SS) (NS)

(SS) (NS) (SS) (N,NS) R.nivicola(N) (NS, N) (C,S) (SS) (SS) (NS) chloroplast JSA tree

30 (SS) (A) (SS) (SS) (SS) (NS) (SS) (SS) (SS) (N) R.nivicola

(SS) (NS,N) (A) (NS) (NS) (NS) (SS,NS) (NS) (NS,N) (NS) (SS) (NS) (SS) nuclear

ITS tree 31 consensus network (ITStree+JSAtree) I III R.nivicola II 32 Maximum agreement subtrees

Concept Computational complexity 33 Comparing trees Splits metric (RobinsonFoulds) Statistical aspects. Tree rearrangement

operations the graph of trees (rSPR). Cophylogeny 34 Co-phylogeny (m. charleston) 35 Supertrees Compatibility concept Compatibility of rooted trees (BUILD) Why do we want to do this? Extension higher order taxa, dates Methods for handling incompatible trees

(MRP; mincut variants; minflip) 36 Compatibility A set Q of quartets is compatible if there is a phylogenetic X-tree T that displays each quartet of Q Example: Q={12|34, 13|45, 14|26} 1 3 2 4 5

6 Complexity? 37 Supertrees Compatibility concept Compatibility of rooted trees (BUILD) Why do we want to do this? Extension higher order taxa, dates Methods for handling incompatible trees (MRP; mincut variants; minflip) 38 Phylogenetic networks

Consensus setting: consensus networks Minimizing hybrid/reticulate vertices Supernetworks Z closure, filtering 39 a b c d

a c b d a b c d Networks can represent: Reticulate evolution (eg. hybrid species) Phylogenetic uncertainty (i.e. possible alternative trees) Z-closure

species, Given T1,, Tk on overlapping sets of let (T1 ) (Tk ) construct spcl2() and construct the splits graph of the resulting splits that are full. 40 Split closure operation (Meacham 1986) A1 A2 A1

B1 B2 B1UB2 , A1UA2 B2 B1 A2 A1 B2 41

42 43 Reconstructing ancestral sequences Methods (MP, Likelihood, Bayesian) Quiz. MP for a balanced tree = majority state? Information-theoretic considerations 44 Statistics of parsimony (clustering on a tree)

45