For Wednesday: Excerpt from Lie's book on Molecular evolution (WebCT assignment#16)
For Friday: try to understand non-parametric bootstrapping
For Monday: Read through chapter 8 p. 158 to 179
Intro to phylogenetic reconstruction
|Compilation of sequence dataset|
|Determination of substitution model|
Why phylogenetic reconstruction of molecular evolution?
e.g.: Who were the first angiosperms? (i.e. where are the first angiosperms located relative
Where in the tree of life is the last common ancestor located?
e.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer, drug targets, detection of genes that drive evolution of a species/population (e.g. influenza virus, see here for more examples)
1) Obtain sequences
Databank Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data
2) Determine homology (see notes for earlier classes for practical implementation)
Reminder on Definitions:
Types of homology:
Orthology: bifurcation in molecular tree reflects speciation
3) Align sequences
4) Reconstruct evolutionary history
5) Interpret the result.
Bootstrapping - how to assess reliability of partitions given in a tree.
Baron Karl Friedrich Hieronymus von Münchhausen
is one of the most popular ways to assess the reliability of branches. The term
bootstrapping goes back to the Baron Münchhausen (pulled himself out of a
swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly
sampled from the multiple sequence alignment with replacements.?
The sampled positions are assembled into new data sets, the so-called bootstrapped
samples. Each position has an about 63% chance to make it into a particular bootstrapped
sample. If a grouping has a lot of support, it will be supported by at least some
positions in each of the bootstrapped samples, and all the bootstrapped samples
will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic
Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.
Create a bootstrap sample on the blackboard
Continue at last years class here on Bootstrap and shortcomings of trees calculated with clustalw.
Bootstrap and non-informative data (here)
Questions and answers on bootstrap
Discussion qiuz 5 (here).
Discussion of exercises form last Friday:
Correction for multiple substitutions. Exclusion of sites with gaps. Artifacts.
A) Maximum likelihood ratio test
The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:
1. different sites along a sequence experience substitutions with different frequency
2. if a replacement occurs, the different types of replacements occur with different probabilities
Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem 1. Assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.
The so called gamma function has become very popular for this purpose. A good and readable overview was published by Z. Yang: The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11: 367-372. (1996) here
The gamma -distribution is useful because a single parameter (the shape parameter a) continuously alters the character of the distribution. With a = infinity all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (a= 2), and normal distributions (a >10). (see here for graphs)
If you want to compare two models of evolution (this includes the tree) given a certain data set, you can utilize the so-called maximum likelihood ratio test. If L1 and L2 are the likelihoods of the two models, d =2(logL1-logL2) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters (i.e., how many parameters are used to describe the substitution process). In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. In case of a molecular clock assumption, the model that assumes a clock has n-2 fewer parameters (as all sequences end up in the present at the same level, their branches cannot be freely chosen).
You can either look up the Chi-square distribution in a table, or you can go to Paul Lewis webpage, (http://hydrodictyon.eeb.uconn.edu/people/plewis/downloads/chiscalc.exe)
B) Maximum likelihood mapping
An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia lamblia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.
You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartets support each of the possible orientations of the branch.
E.g. if one wants to know if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy sample output see this sample ml-map. A more complicated result from the analysis of carbamoyl phosphate synthetase domains is here.
C) Bayesian Posterior probabilities with TREEPUZZLE and MrBayes
The formula used by Strimmer and von Haeseler (here) to calculate posterior probabilities (i.e. the probability that tree topology Ti is true given an aligned set of four sequences) considers only three trees (i.e. branch lengths and topology), each with the same a priori probability. These three trees are those that have the highest likelihood for the three possible topologies. However, there are infinitely many other trees that differ from the three chosen ones only by differences in branch lengths. What is the effect on the calculated posterior probability to use only the single best tree as a representative of all the trees with the same topology? There is no a priori reason to exclude the other trees that have slightly lower likelihoods.
different approach that does not make these assumptions is the use of Markov Chain
Monte Carlo methods to explore tree space. The Program MrBayes
written by Huelsenbeck and Ronquist performs such a random walk in tree space.
Trees with a higher probability are visited more often then those with a lower
probability. Some slides regarding ml mapping and Bayesian posterior probability
If after going through the slides in the last link, you are interested to explore Bayes theorem [i.e., the posterior probability of an event given the data = (the probability of the data given the event) times (the probality of the event / probability of the data), go to this illustration by Olga Zhaxybayeva.
You can use MrBayes to calculate the probabilies of trees with different topology, bipartions, etc. If you calculate the consensus of all trees visited after the "burn in" phase, the percent of trees that have a certain partion directly reflects the posterior probability of this partition.
Paul Lewis (EEB - UConn) has written a very readable and thorough descriptions of the Bayesian approach: from MCB/EEB372 class 22
Paul Lewis' MCRobot program that illustrates the MCMC approach to estimate posterior probabilities is here.
For those interested to read more about the application of probability mapping to comparative Genome analyses: An article on the use of ml mapping in comparative genome analyses is here. (See Fig1, 2, 3, 4, 7, and Tab. 4); an improved version of probability mapping that solves the problem of poor taxon sampling inherent with quartet analyses is here, and an article that describes the extension to more than 4 genomes is here.