Assignment for Friday's class:
- Go through today's blast slides and think about how you will transfer files back and forth from the cluster.
Assignment for Monday
- Complete take home exam #3 (due on Monday)
E-values and multiple tests
- How can one assess the number of false positives in a Blast ?
- How can one assess the number of false negatives in a Blast ?
- Is the E-value of a match independent of the size of the databank?
- If you select two sequences at random and test their significant similarity, what does the E-value signify? Is this the same as the P-value?
- Assume you have 100 students that repeat this exercise, what would be the expectation for a false positive if the individual test is required to pass the 1% significance level?
- What would you need to do to have false positives with an overall (for all 100 students) rate of 1%? Which significance level would the individual experiment need to pass?
Types of Error in a Databank search
False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expeditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.
You could apply the Bonferroni correction:
The significance level for the individual test is calculated through dividing the overall desired significance level by the number of parallel tests. The null-hypothesis of the overall test that is to be be rejected is that None of the individual tests is significantly different from chance. (The opposite of none being "at least one")
False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, on average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.
- Each research group applies significance testing on their own. How can this lead to the decay of significance. How can this be corrected?
- Where to get genomes: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ (check out different files (.faa .fna .ffn .ptt .gff) using A. veronii and A. salmonicida as example)
- do example on cluster
- discuss example analysis for pairwise genome comparison is here (discuss %identity, alignment length, #of identical aa)
- example of what else one could do with this:
- (perl script to extract top scoring hits is here -- extra credit: how would you modify the script, to print out not only the first reported match for a query, but all hits that have equally good E-values to the first one? Note: in Perl the operator for "logical and" is && and "not equal" for a number is != and "not equal" for a string is ne)
- (perl script to replace gi number with position on the genome is here)
- (perl script to make gnuplot is here, output for Thermotoga maritima vs Th. petrophila is here, plots for comparison between different Aeromonas species are here, here and here). Discuss: Which recombination events could have created these patterns?
Goals class 11:
- Appreciate that life may be older than the late heavy bombardment.
- Know how to adjust significant levels of individual experiments to avoid fishing expeditions.
- Know about the problem facing the scientific community through the underreporting of negative results.