Thursday, July 23, 2009

Thursday, October 16, 2008

Are You Ready for Statistics

A famous saying goes as "The world does not lack beauty but lacks the eyes to discover the beauty". Perhaps, both the eyes to find beauty and the mind to sense the random world need to be trained.

Dr. Bradley Efron at Stanford University came to Iowa State three days ago for a biostatistics conference as well as Lawrence Baker lectures at our university. He gave a talk titled as "Learning the experiences from others" in which he provided several examples of empirical Bayesian ideas utilized in practical problems. His talk motivated me to find more of his works including this article.

This article points out that previously people believe science should be exact. Such as how exact time goes by; how sure the light speed is, etc. However, years ago, science began to unveil another side of its face - the science of randomness, and it is played nearly everywhere. Statistics repeats its principles all the time, such as the rainfall example at Palo Alto given in the article in the first paragraph. Thus, statistics study should not be restricted to classrooms or examples in books. It should be widely noticed and studied. People who study statistics should also open their mind.

Monday, May 26, 2008

False Discovery Rate (FDR) Reliability Evaluation

In analysis involving multiple tests, we can control false discovery rate (FDR) by several approaches, such as Benjamini-Hochberg adjusted p value (BH adj.p value), Storey-Tibshirani q value (q value), as well as BH adj.p value and q value based on moderated t-test after empirical bayes adjustment.

The question of interest is which method gives us a better control of FDR. That is, when we want to control FDR at alpha level, which gives us a more reliable control such that the FDR will not deviate far from alpha.

To study this problem, we simulated 14,118 genes with expression under two treatments where the first n (n<14,118) were designed to be truly differentially expressed. Then, we applied three methods for identifying differentially expressed genes - (1) q value based on usual t-test (2) BH adj.p value based on usual t-test (3) q value based on moderated t-test. The purpose is to examine the precision and accuracy of controlling FDR by these different approaches. One result of several obtained results looks like this: (simulate 100 times of 14,118 gene expression under 2 treatment with first 3000 genes truly differently expressed. Set alpha level at 0.05)
Summary statistics are listed as follows:

(1) q value based on usual t statistics

mean.q :0.04565834

var.q :6.964762e-05

(2) BH adj.p value based on usual t statistics

mean.bh.p:0.03763231

var.bh.p:6.955762e-05

(3) q value based on moderated t statistics

mean.ebayes.q:0.04702102

var.ebayes.q:3.80829e-05

The above statistics tell us that the q value based on the empirical Bayes moderated t test gives the best precision and accuracy. Similar tests can be run many times to compare and evaluate these FDR control approaches.

Sunday, April 20, 2008

Occurance Competing problem


Problem: Suppose we flip a coin. Given two patternss of the same length composed of "Head" and "Tail", compute the probability of seeing the first string before the second one.

Solution: First-step Analysis.

Suppose the two strings are: HHTH, HHHT.

You have graph with states and transition probabilities in the Figure.

Define PH: The probability of seeing pattern HHTH before HHHT if we are currently at state H.
Similary, we can define PT, PHH, PHHH, PHHT, PHHHT, PHHTH.
Define P(H) and P(T) is the probability of emiting an H or a T. For a fair coin, we have P(H) = P(T) = 0.5.

Thus, we have the system of equations as follows:
PH = P(H)*PHH + P(T)*PT ---------(1)
(explanation: When you stand at state H, you have probability P(H) to go to the state HH, and probability P(T) to go to the state T)
PT = P(H)*PH + P(T)*PT ----------(2)
PHH = P(H)*PHHH + P(T)*PHHT -----------(3)
PHHH = P(H)*PHHH+P(T)*PHHHT -----------(4)
PHHT = P(H)*PHHTH+P(T)*PT ------------(5)
PHHHT = 0 ---------(6)
PHHTH = 1 ---------(7)

After plugging the equations (6) and (7) in the first 5 equations, you get five unknowns and 5 equations. Now you should be able to solve the system equations to get the value of these five unknowns. They are the probabilities of seeing the first pattern before the second one given the current state.

Thus, the probability of seeing pattern HHTH before the pattern HHHT is:
Porb{HHTH occur before HHHT} = P(H)*PH + P(T)*PT.

Friday, March 28, 2008

A Tutorial for Probability

http://www.math.uah.edu/stat/foundations/index.xhtml

Tuesday, March 18, 2008

Nugget

Interdisciplinary study is not a way to avoid hard work and deep understanding of one specific field. In contrast, people who want to be experts in interdisciplinary study need to do a lot more work than others.

I am currently a graduate student at Bioinformatics and Computational Biology program. Every year our program recruits around 10 students from hundreds of applicants all over the world. After the first two-year training, no matter what your college background is, one is supposed to stand at the graduate level of computer science, biology, and statistics, and be able to talk about any of them with sufficient understanding.

This is hard, challenging, and perhaps having increasing need in future. Bioinformatics here only serves as an example.

When I worked with biologists on statistical analysis, though I have had several graduate level biology courses, I still need to sit down and ask some questions about biology. Sometimes I would wonder what if I do not know so much biology as well as computer science. What if I am only a statistics person, would I be able to do as much as I can do now? Would I be able to collabrate with people and communicate with them efficiently? Would I be able to read math biology papers and computational biology papers with ease? I think the answers to all of them are no.

In our program, not many of professors are well rounded in all these three fields. And it is very hard to find qualified TAs to correct homeworks since this field has just started.

Saturday, February 23, 2008

Shuffle Compositions of a Sequence to Remain Di- and Mono- letter Frequencies


There is a very interesting application of graph in solving a bioinformatics problem raised in our bioinformatics class.

Given a sequence constructed by the alphabet A, T, G, C, how to find all possible sequences which keep both the frequencies of di-nucleotide and mono-nucleotide.

Suppose we have sequence CGTGAGC. Our aim is to remain the occurring number of all possible 16 di-nucleotide frequencies, 4 mono-nucleotide frequencies, as well as the length of the sequence.

This could be viewed as a topology question if we convert it to a graph such that each vertex is a node A, T, G or C, and the occurance of each di-nucleotide is a directed bridge from the first letter to the second one. This idea is displayed in the above figure.

To find a sequence which keeps the di-letter frequencies should be the one starting from one vertex and traversing all edges. If any of the node in the graph has odd number of edges, this node should be the starting point or the ending point (Euler's seven bridge puzzle). While, here, every vertex has even number of edges. Starting from any vertex and traversing all edges would give us a sequence which keeps the di-nucleotide frequencies. However, if and only if one begins with C and ends with C can it give us a sequences which also remains the mono-letter frequencies. We can prove this by contradiction. Suppose we can find a sequence satisfying the di-letter frequencies and not begun with C, then, a C must appear in the middle. With respect to this example, CG must in the middle. Then, another di-letter which ends with C must be right ahead of it in order to connect with CG. So we have GC right ahead of CG, and form GCG. Since in the original sequence, the beginning C and the ending C are counted separately twice, and only once in the new sequence GCG, the mono-letter frequency of C is automatically deducted once. Thus, we could keep the mono-nucleotide frequency if and only if we begin and end the with the same letters of the original one(s).