
There is a very interesting application of graph in solving a bioinformatics problem raised in our bioinformatics class.
Given a sequence constructed by the alphabet A, T, G, C, how to find all possible sequences which keep both the frequencies of di-nucleotide and mono-nucleotide.
Suppose we have sequence CGTGAGC. Our aim is to remain the occurring number of all possible 16 di-nucleotide frequencies, 4 mono-nucleotide frequencies, as well as the length of the sequence.
This could be viewed as a topology question if we convert it to a graph such that each vertex is a node A, T, G or C, and the occurance of each di-nucleotide is a directed bridge from the first letter to the second one. This idea is displayed in the above figure.
To find a sequence which keeps the di-letter frequencies should be the one starting from one vertex and traversing all edges. If any of the node in the graph has odd number of edges, this node should be the starting point or the ending point (Euler's seven bridge puzzle). While, here, every vertex has even number of edges. Starting from any vertex and traversing all edges would give us a sequence which keeps the di-nucleotide frequencies. However, if and only if one begins with C and ends with C can it give us a sequences which also remains the mono-letter frequencies. We can prove this by contradiction. Suppose we can find a sequence satisfying the di-letter frequencies and not begun with C, then, a C must appear in the middle. With respect to this example, CG must in the middle. Then, another di-letter which ends with C must be right ahead of it in order to connect with CG. So we have GC right ahead of CG, and form GCG. Since in the original sequence, the beginning C and the ending C are counted separately twice, and only once in the new sequence GCG, the mono-letter frequency of C is automatically deducted once. Thus, we could keep the mono-nucleotide frequency if and only if we begin and end the with the same letters of the original one(s).

0 comments:
Post a Comment