Introduction to phylogenetics
Erick Matsen
(With a great exercise from Trevor)
“data”: sequence alignment
![](figures/phylo/bigProtAln.png)
Sequence alignment is hard!
![](figures/phylo/alignment-and-phylogeny-protpal-truth-only.png)
Sometimes it’s more or less impossible
![](figures/phylo/alignment-and-phylogeny-protpal-hard-case.png)
Phylogenetic approach is best
Muscle vs PRANK
Co-estimate tree & alignment using Bayesian methods ![](figures/phylo/bali-phy.png)
Multiple sequence alignment: summary
- Use PRANK, perhaps through Wasabi
- If you need a faster algorithm and are happy with a less good result, try Muscle
- If you really care about alignments (and trees) use BAli-Phy.
Types of phylogenetic inference methods
- Distance-based
- Parsimony
- Likelihood-based
- Maximum likelihood
- Bayesian
Distance-based phylogenetics
![](figures/phylo/what_is_phylo_dist_nofonts.svg)
Parsimony phylogenetics
![](figures/phylo/tree_parsimony_nofonts.svg)
Among competing
hypotheses that predict equally well, the one with the fewest assumptions should be selected.
(The next few slides are from Trevor.)
Parsimony suggests this topology requires 3 mutations at minimum
Parsimony suggests both topologies equally tenable
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Exercise: which topology is more parsimonious?
Likelihood setup
- Come up with a statistical model of experiment
- Parametrize that model
- Evaluate likelihood under various parameter values
Example: flipping coins
Say that, after flipping a coin 20 times, we get 6 heads.
Model using the binomial distribution. Say \(p\) is the probability of getting a tail, and each draw is independent.
The likelihood of getting the observed result is \[
{ {20} \choose 6} \, p^6 \, (1-p)^{20-6}.
\] Recall: \({ {20} \choose 6}\) is the number of ways of choosing 6 items out of 20.
Exercise: likelihood surface
- Run
Beta_binomial.R
in R
- Change the number of heads and tails.
- The maximum likelihood estimate of the parameter of interest is the parameter value(s) that maximize the likelihood.
Here it's marked with a dot.
- What do you notice about the ML value when the observations are rare?
- How does the shape of the likelihood surface change when we get more observations?
Likelihood recap
- Maximum likelihood is a way of inferring unknown parameters
- To apply likelihood, we need a model of the system under investigation
- In general, the “likelihood” is the likelihood of generating the data under the given parameters, written \(P(D | \theta),\) where \(D\) is the data and \(\theta\) are the parameters.
Setup for likelihood based phylogenetics
The phylogenetic likelihood of a tree is the likelihood of generating the observed data given that tree (under the sequence evolution model).
![](figures/phylo/what_is_phylo_ML.scour_nofonts.svg)
Note that the UW’s own Joe Felsenstein was the first to formalize this and develop efficient algorithms.
Sequence evolution models tell us the probability of seeing a certain mutation in some period of (evolutionary) time
- Nucleotide models are fit “on the fly”
- Protein models are typically pre-made
- e.g. JTT (Jones, Taylor, and Thornton), and WAG (Whelan and Goldman) matrices
- Codon models are a great idea
- Position matters!
- e.g. SRD06 model
Calculating likelihood of a single column
![](figures/phylo/column_likelihood_nofonts.svg)
Likelihood of an alignment
![](figures/phylo/aln_likelihood.scour_nofonts.svg)
Note assumption of independence between sites!
The phylogenetic likelihood of a tree is the likelihood of generating the observed data given that tree (under the sequence evolution model)
- Maximum likelihood gives a point estimate
- Confidence is assessed using the bootstrap
- Lots of flexibility with models
\[
P(\theta \mid D) \propto P(D \mid \theta) P(\theta)
\]
Exercise: the prior and the posterior
- Open
Beta_binomial.R
again.
- Return the number of heads and tails to zero, and click "Bayesian".
- Alpha and Beta are the parameters of the prior. Try changing them around.
- Make them such that the prior is peaked around 0.5.
- Now, what happens when you vary the number of heads and tails?
The posterior probability of a tree is the probability that the observed tree is correct (given the model and priors)
- Bayesians sample from this posterior
- If you can deal with a prior, it’s the statistically right thing to do
- Sometimes we aren’t actually interested in the tree, so we can integrate it out
- But! Short alignment, 100 taxa = hours
Markov chain Monte Carlo
![](figures/sprmix/5-taxa-treespace-mcmc-unrooted_nofonts.svg)
- If you jump to a better tree, accept that move
- If you jump to a worse tree, accept that move with a non-zero probability
- It’s all arranged so that you sample trees in proportion to their posterior probability
Subset to high probability nodes
![](figures/sprmix/5-taxa-treespace-subgraph-unrooted_nofonts.svg)
Real tree spaces have bottlenecks
Whidden & M, Systematic Biology, 2015
Likelihood phylogenetics recap
- In likelihood phylogenetics, explicitly model mutation process
- This allows complex models to be used
- Statistical basis allows us to make formal statements about uncertainty
- But on the other hand our models are over-simple!
Crazy but typical model assumptions
- differences between sequences only appear by point mutation
- evolution happens on each column independently
- sequences are evolving according to reversible models (this excludes selection and directional evolution of base composition)
- the evolutionary process is identical on all branches of the tree
Software
- FastTree – approximate ML
- RAxML and PhyML – somewhat less approximate ML
- BEAST – Bayesian
- MrBayes – Bayesian
- many others, but these are the ones I know …