For this project, you should work in a two-person team, and each team will submit one project report. We ask that you not work on a project with the same partner twice during the course.
For our purposes, a gene is represented as a string of characters,
each of which is A, C, G,
or T. These represent nucleotides within a strand of
DNA, but you don't need to be concerned with the details: all you need
to know is that if two genes are similar, they will have largely the
same letters in the same order, though not necessarily appearing
consecutively, as the differences between the genes might include
areas where one gene has some extra letters inserted.
One reason why two genes may be similar is because they have an evolutionary relationship. In this case, they are said to be "homologous," or alternatively, that each is a "homolog" of the other. In this lab, you are going to be looking at two different kinds of homologs. An "ortholog" is a homolog in another species. For example, given how closely the human species Homo sapiens is related to the chimpanzee species Pan troglodytes, it shouldn't come as a surprise that biologists sequencing the chimpanzee genome have recently found genes that are orthologs of known human genes; you will see an example of such an ortholog in this lab. A "paralog," on the other hand, is a homolog that is in the same species as the original gene. That is, if a species has a pair of paralogous genes, that means they are near-duplicates. At some point in evolutionary history, the gene was duplicated, and since that point, there have been small changes in each copy. In this lab, you'll also be working with paralogs within the human genome.
Specifically, you'll be working with three genes, two from the human genome and one from the chimpanzee genome (drawn from a chimp named "Clint.") I downloaded all three reference sequences from the Entrez Nucleotide database at the National Center for Biotechnology Information (NCBI). The two human genes are known as HBG1 and HBG2. As their similar names suggest, they are paralogs; in the course of the lab, you'll find out just how similar they are. Both encode proteins used in fetal hemoglobin (the oxygen-carrying constituent of red blood cells); fetal hemoglobin is somewhat different from the adult version. The chimp gene is an ortholog of both human genes, though it is somewhat more similar to one of them than to the other. In this lab, you will be determining which of the human genes is a closer ortholog to the chimp gene. For that reason, I won't tell you the name of the chimp gene, which might give away the answer; instead I'll call it GenX. I've linked to this lab handout a file with Scheme definitions for the three sequences; in shortened form, they are as follows:
(define HBG1 "ACACT...ATCAC") ;full version has 1586 characters (define HBG2 "ACACT...ATCAC") ;full version has 1591 characters (define GenX "AGGGG...TGCAT") ;full version has 1775 characters
There are many different measures of similarity that are used in bioinformatics, each of which would produce somewhat different results. In this lab, you will use the simplest of those measures, namely the length of the longest common subsequence. The same general technique you will use to compute the longest common subsequence length can also be used for the more sophisticated measures of similarity.
The longest common subsequence can best be explained using an illustration. In the following diagram, two sequences are shown with a longest common subsequence of length 6:
You can tell that ACTCCA is a longest common subsequence in this example because the connecting lines follow these rules:
Each line connects two identical characters, one in each sequence; thus, we are identifying characters that the sequences have in common.
No two lines cross; this ensures that the characters appear in the same order in both sequences, forming a subsequence of each.
There are as many lines as possible, subject to the first two rules, so that the common subsequence is the longest that is possible.
In this lab, you will find the longest common subsequence length for HBG1 and HBG2 and express this as a percentage of the genes' lengths, in order to measure how similar these paralogs are. You will then find the longest common subsequence length for GenX with each of HBG1 and HBG2, in order to determine which of the human genes is the closer ortholog of the chimpanzee gene.
In order to compute the longest common subsequence of two strings, you can start by comparing their last characters. For example, in trying to find the longest common subsequence of TCGCGA and GTCCA, you can start by noticing that both have A as their last letter. Because of this, you know that this A can be included in a longest common subsequence. As such, the longest common subsequence length can be found by adding 1 to the longest common subsequence length for TCGCG and GTCC, which are formed by omitting the last character from each of the original strings. Using LCS as a mathematical function representing the length of the longest common subsequence of two strings, we have the following equation:
LCS("TCGCGA", "GTCCA") = LCS("TCGCG", "GTCC") + 1
Be sure you understand the logic behind this equation. It can be used to calculate the value of the left hand side (4) by using the right hand side (3+1).
On the other hand, what if the last characters from the two strings differ? Then at least one of them must be omitted from the longest common subsequence. We can try leaving each of them out and see which gives the larger result. For example,
LCS("TCGCG", "GTCC") = max(LCS("TCGC", "GTCC"), LCS("TCGCG", "GTC"))
In this particular case, the first option, where string 1 is shortened, provides a larger result than the second option, where string 2 is shortened. That is because the first LCS appearing inside the max(...) notation has value 3, whereas the second one has value 2. Again, make sure you understand what is going on here. In general, both options need to be computed and whichever is larger chosen.
Before this project, all we have done with strings is to display them. Now we
need to do more. The procedure string-ref can tell you
what character is in a particular position within the string. The
positions are numbered starting from 0, just like with vector-ref.
For example, (string-ref "CAT" 0) would evaluate to the
character C, and (string-ref "CAT" 2) would
evaluate to the character T. These two characters are written
in Scheme as #\C and #\T respectively.
Characters are a new kind of data, which you have not previously seen.
All you will need to do with them is test them for equality, which you
can do using the predicate equal?.
The following section also makes use of string-length,
which tells how many characters are in a string. For example,
(string-length "CAT") evaluates to 3. However, you will
not need to make any other use of this procedure in your own code.
As described earlier, the LCS of two strings is found in terms of the LCS of shorter strings, where a string is shortened by leaving off its last character. It would be possible to follow this approach literally in the program, at each step making a new string that contains a copy of all but the last character of the old string. However, that would be very inefficient. Instead, we'll take the approach of telling the LCS procedure how many characters from each string to pay attention to. It will simply ignore any additional characters in the strings. Therefore, the strings can be left unchanged throughout the computation. We'll simply pay attention to progressively fewer characters by subtracting 1 from the count of characters to use.
Our main procedure can be defined as follows:
(define lcs-of-prefixes
(lambda (string1 n1 string2 n2)
;; This procedure (once you write it) will
;; compute the the LCS length for the
;; first n1 characters of string1 and
;; the first n2 characters of string2,
;; ignoring any additional characters
;; in string1 and string2. It will do so
;; by recursive computations that focus
;; in on n1 - 1 characters of string1
;; and/or n2 - 1 characters of string2.
...))
This procedure is the only one you really need; you can test the similarity of the genes by using expressions such as
(lcs-of-prefixes HBG1 1586 HBG2 1591)
to compare all 1586 characters of HBG1 with all 1591 characters of HBG2. However, rather than requiring your user to know how long each string is, we can provide them with a simplified user-interface procedure:
(define lcs
(lambda (string1 string2)
(lcs-of-prefixes string1 (string-length string1)
string2 (string-length string2))))
There are only two little problems: (1) You still need to
write lcs-of-prefixes. (2) If it is going to work for
strings containing thousands of characters, like the genes, it had
better scale up reasonably well.
To start with, write a recursive version
of lcs-of-prefixes by translating into Scheme the
earlier English description of how to calculate the length of a
longest common subsequence. Remember, you are only supposed to be
paying attention to the first n1 and n2
characters in the strings, so where the English description talks
about comparing the last characters, you really should be comparing
the last character of the first n1 characters with the
last character of the first n2 characters.
For testing your procedure, you should try some
reasonably short strings; for example, we saw earlier that
(lcs "TCGCGA" "GTCCA") should evaluate to 4.
It would
also be interesting to see how the procedure scales up in the two
simplest cases: totally identical strings and totally different
strings. For this testing, we can define a string containing 1000
copies of A and one containing 1000 copies
of T:
(define As (make-string 1000 #\A)) (define Ts (make-string 1000 #\T))
Now you could try experiments such as measuring the similarity of 10 As with 5 As or measuring the similarity of 10 As with 10 Ts. Make sure you know what the correct answer would be in each case and check that your procedure works correctly. Also, you should see that the time the procedure takes scales up very differently with increasing prefix lengths in these two cases. Looking at the way LCS is calculated, can you explain why? Where does tree-recursion show up in your procedure?
You could try using your procedure to answer the original biological questions which motivated this lab, but I expect you will find it is not fast enough. Given what you discovered about how rapidly the tree-recursive procedure's runtime can grow with increasing problem size, this shouldn't come as a huge surprise. Nor, given the current topic in our course, should the solution.
Save your inefficient version of lcs-of-prefixes so
that you can later compare its performance with your new version.
Write the new version by using either memoization or dynamic
programing.
Test that the new procedure gives the same answers as the old one, for all the same test cases you used before. Also, verify that the new procedure scales up much better in the situation where the old version scaled up poorly. Even without doing quantitative timing tests, you should find that you have the patience for much longer strings.
Now you should be able to answer the biological questions. How similar are HBG1 and HBG2? That is, what percentage of each of their lengths is accounted for by the longest common subsequence? And, which of them is GenX more similar to? (That is, which of them has a longer common subsequence with GenX?)
To get a
more precise sense of how poorly the tree-recursive process scales up,
and how much better your new version is,
you should time how long the following procedure takes for varying
values of n
and graph your results, using each of the two versions of
lcs-of-prefixes:
(define lcs-AsTs
(lambda (n)
(lcs-of-prefixes As n Ts n)))
You should do the timings using the
time procedure in the file ~mc27/public/time.scm;
this is the one you used in the MCS-177 lab concerning perfect
numbers.
You should use semi-log, log-log and regular graph paper found in http://homepages.gac.edu/~wolfe/courses/graph-paper/. Recall that
When plotted on regular graph paper, only linear data of the form an+b should look straight.
When plotted on log-log paper, polynomials of the form anb should appear straight. The slope of the line is affected by b.
When plotted on semi-log paper (with the y-axis a logarithmic scale), exponentials of the form abn should appear straight. The slope of the line is affected by b.
In any case, be sure to choose axes so that the straightest curve has a nice slope heading northeast on the paper.
Try to predict, by looking at your graphs, the running times of your two versions. Solve for b in anb or abn by looking at the graphs. Predict how long the string prefixes could be and still get the computation done in a day's time with the two programs.
If you are able to analytically deduce the running time of your programs (in Theta-notation) by analyzing the programs rather than the data, that's a bonus. Another extra-credit opportunity is if you can make your efficient version use only thousands of table locations rather than millions.
Write up a scientific project report that explains the data you collected and any conclusions you have drawn from your data. In your report, your primary goal is to compare the performance of the two versions of the program; the underlying biological questions about the genes' similarity are secondary. (You can briefly discuss the biological questions, which serve as a motivation for why you need a program that will take a reasonable length of time when presented with thousands of characters.)
To make your quantitative data easy to interpret, present it both graphically and tabularly. You should attempt to compare the performance of the two versions of your program using your experimental results. To facilitate the comparison, the data from both versions should appear on each graph or table.
The gradesheet for the project is available in PostScript or PDF format. (If you print a copy out, you can staple it to the front of your project report to save paper.)
I am grateful to Professor Colleen Jacks of the Biology Department for correcting errors in an earlier draft of this assignment. Those that remain are my own fault. I am also grateful to the many scientists who were involved in making the data available.