Assignments for Today:
Assignments for Wednesday
One problem in maintaining databanks (supervised and unsupervised) is "owner ship" of sequences, which in many data banks prevents a continuous update of sequences. Even if errors are detected, they are not easily removed form the databank. E.g. ATP synthase operons in E.coli see Fig.1 in http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.033811-0#tab2
Even species names are often wrongly assigned (slides)
Discuss exponential functions. (see here)
With respect to the coral of life (additional slides here) discuss if the concept of phylogeny is compatible with reticulation events.
Alternatives to evolution by natural selection?
In biological evolution, what processes might go beyond natural selection?
Do these processes conflict with "Darwinian evolution" or only with the modern synthesis?
- Horizontal gene transfer and recombination
- Polyploidization (angiosperm and vertebrate evolution) see here and here
- Fusion and cooperation of organisms (Kefir, lichen, also the eukaryotic cell)
- Evolution of the holobiont (host + symbionts)
- Targeted mutations (?), genetic memory (?) (see Foster's and Hall's reviews on directed/adaptive mutations; see here for a counterpoint)
- Random genetic drift
- Gratuitous complexity
- Selfish genes (who/what is the subject of evolution?)
- Parasitism, altruism, gene transfer agents
- Mutationism, hopeful monsters
If you can demonstrate significant similarity using randomization, your sequences are homologous (i.e. related by common ancestry). Convergent evolution has not been shown to lead to sequence similarities between complex sequences detectable through pairwise comparison.
When are two similar
sequences significantly similar/homologous? (The
opposite to homology is analogy, due to convergent evolution.)
(The opposite to homology is analogy, due to convergent evolution.)
(Note: we will discuss alignment algorithms later, for now it is sufficient to know that given a scoring matrix and two sequences, one can calculate an alignment that has an optimal score)
One way to quantify the similarity between two sequences is to
1. compare the actual sequences and calculate alignment score
2. randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.
3. repeat step 2 at least 100 times
4. describe distribution of randomized alignment scores
5. do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences
To force the program to report an alignment increase the E-value.
approach similar to PRSS is used in the FASTA database search.
If one chooses to display a histogram of the search, the output includes the histogram
of all the alignment scores obtained with the individual sequences contained in
the database. Includes are the actual sequence scores, and the ones that are expected
based on a probability distribution. An example is here.
An approach similar to PRSS is used in the FASTA database search. If one chooses to display a histogram of the search, the output includes the histogram of all the alignment scores obtained with the individual sequences contained in the database. Includes are the actual sequence scores, and the ones that are expected based on a probability distribution. An example is here.
Summary of Terminology:
E-values give the expected number of matches with an alignment score this good or better due to chance alone (no shared ancestry, no convergent evolution)
P-values give the probability of to find a match of this quality or better due to chance alone (no shared ancestry, no convergent evolution). The P value is equal to the probability that the null hypothesis (similarity is due to chance alone) is true. This probability is also known as the significance level a which the null hypothesis can be rejected.
P values are [0,1], E-values are [0,infinity).
Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binding sites are depicted here.)
DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.
and F1-ATPase. Both form hexamers with something rotating in the middle (either
the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers
have the same type of nucleotide binding fold (picture)
Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)
Discuss how the P values should be adjusted in case multiple tests are performed.
False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, an average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.
Goals class 8: