Reading assignments

For Wednesday: Excerpt from Lie's book on Molecular evolution (WebCT assignment#16)

For Friday: try to understand non-parametric bootstrapping

For Monday: Read through chapter 8 p. 158 to 179


Intro to phylogenetic reconstruction

Phylogenetic analysis is an inference of evolutionary relationships between organisms.
Those relationships are usually represented by tree-like diagrams . Note: the assumption of tree-likeliness of evolution is controversial.

Steps of the phylogenetic analysis:

Compilation of sequence dataset
Determination of substitution model
Tree building
Tree evaluation


Why phylogenetic reconstruction of molecular evolution?

    1. systematic classification of organisms
    2. e.g.: Who were the first angiosperms? (i.e. where are the first angiosperms located relative
      to present day angiosperms?)

      Where in the tree of life is the last common ancestor located?

    3. Evolution of molecules

e.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer, drug targets, detection of genes that drive evolution of a species/population (e.g. influenza virus, see here for more examples)


1) Obtain sequences


Databank Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data


2) Determine homology (see notes for earlier classes for practical implementation)

Reminder on Definitions:
Homology: Two sequences are homologous, if there existed an ancestral molecule in the past that is ancestral to both of the sequences

Types of homology:

Orthology: bifurcation in molecular tree reflects speciation
Paralogy: bifurcation in molecular tree reflects gene duplication
Xenology: gene was obtained by organism through horizontal transfer
Synology: genes ended up in one organism through fusion of lineages.

3) Align sequences

Most algorithms used for phylogenetic reconstruction require a global alignment. An exception is statalign
from Thorne JL, and Kishino H, 1992, Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162

algorithms doing a global alignment:

      1. clustalw1.83
      2. local alignments (MACAW, BLAST)
      3. MUSCLE
      4. T-coffee

Select part of the alignment that is reliable! Modify alignment, if necessary.

4) Reconstruct evolutionary history

    Approaches to phylogenetic reconstruction

      A) Distance analyses

        1. calculate pairwise distances
          (different distance measures, correction for multiple hits, correction for codon bias)
        2. make distance matrix (table of pairwise corrected distances)
        3. calculate tree from distance matrix
    i) using optimality criterion
    (e.g.: smallest error between distance matrix
    and distances in tree, or use
    ii) algorithmic approaches (UPGMA or neighbor joining)

      B) Parsimony analyses

        find that tree that explains sequence data with minimum number of substitutions

        (tree includes hypothesis of sequence at each of the nodes)


      C) Maximum Likelihood analyses

        given a model for sequence evolution, find the tree that has the highest probability under this model.

        This approach can also be used to successively refine the model.

        Bayesian statistics use ML analyses to calculate posterior probabilities for trees, clades and evolutionary parameters. Especially MCMC approaches have become very popular in the last year, because they allow to estimate evolutionary parameters (e.g., which site in a virus protein is under positive selection), without assuming that one actually knows the "true" phylogeny.


        D - ...) Else:
        spectral analyses, evolutionary parsimony, i.e., look only at patterns of substitutions,

    Another way to categorize methods of phylogenetic reconstruction is to ask if they are using

    • an optimality criterion (e.g.: smallest error between distance matrix and distances in tree, least number of steps), or
    • algorithmic approaches (UPGMA or neighbor joining)
5) Interpret the result.

It is especially important to consider artifacts that might originate in phylogenetic reconstruction, and to asses the reliability of your results.


Bootstrapping - how to assess reliability of partitions given in a tree.

Baron Karl Friedrich Hieronymus von Münchhausen

Bootstrapping is one of the most popular ways to assess the reliability of branches. The term bootstrapping goes back to the Baron Münchhausen (pulled himself out of a swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements.? The sampled positions are assembled into new data sets, the so-called bootstrapped samples. Each position has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples, and all the bootstrapped samples will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping thus realizes the impossible: the evolution of sequences in real life happened only once, and it is impossible to run the evolution of, let's say, small subunit ribosomal RNAs again. Nevertheless, using the resampling approach, pseudosamples are generated that have a variation that resembles the variation one would have obtained, if it were possible to sample 100 or 1000 parallel worlds in which the evolution of 16S rRNAs occurred over and over again. You end up with a statistical analyses using a single original sample only.

Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.



Create a bootstrap sample on the blackboard

Continue at last years class here on Bootstrap and shortcomings of trees calculated with clustalw.

Bootstrap and non-informative data (here)

Questions and answers on bootstrap

Discussion qiuz 5 (here).

Discussion of exercises form last Friday:
Correction for multiple substitutions. Exclusion of sites with gaps. Artifacts.



Other ways to access strengths of support

A) Maximum likelihood ratio test

The reconstruction of phylogenetic trees from molecular sequences necessitates that you assume a model that describes the evolutionary process. Often these assumptions are not clearly spelled out; and some make the claim that parsimony analyses does not assume a model at all, it just searches for the tree that explains the data with the least number of substitutions. However, an alternative view is that parsimony corresponds to a model in which all substitutions are equally likely. One of the major problems, especially if one wants to calibrate the data with respect to time, is the correction for multiple substitutions. The situation is complicated by two factors:

1.    different sites along a sequence experience substitutions with different frequency

2.    if a replacement occurs, the different types of replacements occur with different probabilities

Both of these considerations are valid for amino acid and nucleotide sequences. Taking both of these processes into account greatly improves the validity of the obtained trees. Two approaches have been used to address problem 1. Assign different weight to different positions a priori (e.g. first, second, third codon position, or stem versus loop regions in rRNA. Or have the program decide which distribution of among site rate variation is present in the data and make the appropriate corrections for multiple substitutions.

The so called gamma function has become very popular for this purpose. A good and readable overview was published by Z. Yang: The among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11: 367-372. (1996) here

The gamma -distribution is useful because a single parameter (the shape parameter a) continuously alters the character of the distribution. With a = infinity all sites change at the same rate; an extreme ASRV where only a few sites vary and the majority of sites are invariant or change only very slowly is obtained with a approaching 0. In between are cases that resemble exponential- (a =1),Poisson- (a= 2), and normal distributions (a >10). (see here for graphs)

If you want to compare two models of evolution (this includes the tree) given a certain data set, you can utilize the so-called maximum likelihood ratio test. If L1 and L2 are the likelihoods of the two models, d =2(logL1-logL2) approximately follows a Chi square distribution with n degrees of freedom. Usually n is the difference in model parameters (i.e., how many parameters are used to describe the substitution process). In particular n can be the difference in branches between two trees (one tree is more resolved than the other). In principle, this test can only be applied if on model is a more refined version of the other. However, if you compare two completely resolved trees with each other that differ only in a single branch, you can, following a suggestion by Joe Felsenstein, use one degree of freedom. In case of a molecular clock assumption, the model that assumes a clock has n-2 fewer parameters (as all sequences end up in the present at the same level, their branches cannot be freely chosen).

You can either look up the Chi-square distribution in a table, or you can go to Paul Lewis webpage, (

B) Maximum likelihood mapping

An often-encountered problem in inspecting trees is the assessment of support for different groupings. E.g. does Giardia lamblia form the deepest branch within the known eukaryotes. Maximum likelihood mapping offers a graphic approach to this problem.

You can generate a ml-map for a single branch, or for the complete dataset. The principle is that for each possible quartet, the probabilities for the three tree types are plotted in a simplex. (Pi= Li/(L1+L2+L3) Note that P1+P2+P3=1; Pi is the (kind of) posterior probability and Li is the likelihood of tree i. If you generate a plot for the whole tree, you learn about how many quartets are resolved with confidence, if you plot only a single branch, you learn how many quartets support each of the possible orientations of the branch.

E.g. if one wants to know if Giardia lamblia is the deepest branch within the eukaryotes, on can choose the "higher" eukaryotes as cluster a, another deep branching eukaryote (one that competes against Giardia) as cluster b, Giardia as cluster c, and the outgroup as cluster d. For an easy sample output see this sample ml-map. A more complicated result from the analysis of carbamoyl phosphate synthetase domains is here.

One can also use ml-mapping to illustrate the information content in a dataset consisting of many sequences. An example is here. Fig. 4 (simulation) and 5 (real data).

C) Bayesian Posterior probabilities with TREEPUZZLE and MrBayes

    The formula used by Strimmer and von Haeseler (here) to calculate posterior probabilities (i.e. the probability that tree topology Ti is true given an aligned set of four sequences) considers only three trees (i.e. branch lengths and topology), each with the same a priori probability. These three trees are those that have the highest likelihood for the three possible topologies. However, there are infinitely many other trees that differ from the three chosen ones only by differences in branch lengths. What is the effect on the calculated posterior probability to use only the single best tree as a representative of all the trees with the same topology? There is no a priori reason to exclude the other trees that have slightly lower likelihoods.

A different approach that does not make these assumptions is the use of Markov Chain Monte Carlo methods to explore tree space. The Program MrBayes written by Huelsenbeck and Ronquist performs such a random walk in tree space. Trees with a higher probability are visited more often then those with a lower probability. Some slides regarding ml mapping and Bayesian posterior probability are here.
If after going through the slides in the last link, you are interested to explore Bayes theorem [i.e., the posterior probability of an event given the data = (the probability of the data given the event) times (the probality of the event / probability of the data), go to this illustration by Olga Zhaxybayeva.

You can use MrBayes to calculate the probabilies of trees with different topology, bipartions, etc. If you calculate the consensus of all trees visited after the "burn in" phase, the percent of trees that have a certain partion directly reflects the posterior probability of this partition.


Additional material:

Paul Lewis (EEB - UConn) has written a very readable and thorough descriptions of the Bayesian approach: from MCB/EEB372 class 22

Paul Lewis' MCRobot program that illustrates the MCMC approach to estimate posterior probabilities is here.

For those interested to read more about the application of probability mapping to comparative Genome analyses: An article on the use of ml mapping in comparative genome analyses is here. (See Fig1, 2, 3, 4, 7, and Tab. 4); an improved version of probability mapping that solves the problem of poor taxon sampling inherent with quartet analyses is here, and an article that describes the extension to more than 4 genomes is here.