Assignment for Friday's class:

Assignments for Monday's class:

Assignments for Today


Types of Error in a Databank search

False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expeditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.

You could apply the Bonferroni correction:

The significance level for the individual test is calculated through dividing the overall desired significance level by the number of parallel tests. The null-hypothesis of the overall test that is to be be rejected is that None of the individual tests is significantly different from chance. (The opposite of none being "at least one")

False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, on average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.


Decay of significance. Can this be corrected?

[ Carry over from class 8:

      Meaning of phylogeny.

      Another example of databank errors: Even species names are often wrongly assigned (slides)

      Discuss exponential functions. (see here)]

Powerpoint slides on blast


If time:

Goals class 9: