DNA Sequence Alignment Checklist
Frequently Asked Questions
What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.
How do I read in the two input strings from the file? Use readLine()
and redirection as usual.
How do I access the length of a string s? The ith character? Use s.length
and s[i]
, respectively. As with arrays, indices start at 0.
Can I assume that the input characters will always be A, C, G or T? NO! Your program should work equally well for any letter, upper case or lower case. It fact, It shoud work for all characters.
What's a StringIndexOutOfBoundsException
? It's just like an ArrayOutOfBoundsException
. It results from invoking s[i]
with an illegal value of i
.
How do I know if my opt
array contains the correct values.? You may want to define a private function that can be used to print two-dimensional array. For debugging purposes, call this function whenever you want to print the values of your opt
array. NOTE - you probably only want to do this on small examples (e.g., input strings of length less than 10).
Which alignment should I print out if there are two or more optimal ones? Output any one you like.
Where can I learn more about dynamic programming? The Longest Common Subsequence (LCS) problem is another example of a dynamic programming problem on strings. However, it is different from the current problem in many ways, so do not simply mimic the code without understanding what it does.
Memory, Timing, and Operating System Issues
What does OutOfMemoryError mean? When java
(the JVM emulator) runs, it requests a certain amount of memory from the operating system. The exact amount depends on the version of java
and your computer but can vary from 64MB to 1024MB (1GB). After java has started, the total size of all variables in use cannot be larger than what it originally requested. Trying to do so causes an OutOfMemoryError
.
For this assignment, the largest test cases use huge arrays, and Java needs to ask for enough memory from the operating system. To explicitly ask for for more (or less) memory, use the -Xmx
flag. For example, to request 500 megabytes (500 MB) of memory for a run, use
kotlin -J-Xmx500m DPEditDistanceKt < input.txt
Here 500m
means 500 MB. You should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you are running. The amount 500MB should get you through ecoli10000.txt. To run ecoli20000.txt you will need to request more memory.
What does "Could not reserve enough space for object heap" mean? This occurs if you use -Xmx with a value that is larger than the amount of available physical memory. Additionally, due to address space limitations, some 32-bit versions of Windows also will give this error if you try to request more than approximately 1.5GB, no matter how much physical memory is installed.
How do I determine how much physical memory is installed on my computer? On Mac, select About this Mac from the Apple menubar. On Windows, press Windows-R (or Run on the Start menu), enter msinfo32
and look for total physical memory.
I'm getting a stack overflow error. What should I do? Ask java for more stack space. For example, to ask for 5MB of stack space, type
kotlin -J-Xss5m MemEditDistanceKt < input.txt
Adjust the amount of stack space request as needed.
How can I measure how long my program takes on each file? To measure the running time of your program, there are a few techniques.
The simplest is to use
kotlin -J-Xmx500m MemEditDistanceKt < input.txt > output.txt
and use a stopwatch. We redirect the output to a file to prevent printing text from becoming a bottleneck.
A second technique, which we think probably best suits the needs of this application, is to use the
-Xprof
runtime switch, which asks Java to print out timing data about the run. To use this with output redirection, typekotlin -J-Xprof -J-Xmx500m DPEditDistanceKt < input.txt > output.txt
The timing information will appear after the program’s output in the file
output.txt
, and you want the "flat profile" for main. The line will look likeFlat profile of 4.5 secs (15 total ticks): main
but these numbers are made up and yours will be different. We don't care about the ticks.
Piping can be useful here. You can skip the output file in the previous step by piping your output to another program that will look for "Flat profile". and it should print out the time (and throw away all the other program output). On a Mac, run
kotlin -J-Xprof -J-Xmx500m MemEditDistanceKt < input.txt | grep "main"
On Windows, use
find
instead ofgrep
:kotlin -J-Xprof -J-Xmx500m DPEditDistanceKt < input.txt | find "main"
This find/grep command searches through all of whatever text it is fed and only prints out the lines containing the text
main
. Typeman grep
(in Terminal) orfind /?
(in Command Prompt) for more information.As a third technique, you can use the
kotlin.system.measureTimeMillis
function, see usage inlcs.zip
for an example.
My timing data do not fit a polynomial hypothesis. What could I be doing wrong?
- If you are running your program and accessing the data files from the Windows H: drive (especially if via a wireless network), the bottleneck for medium-sized test cases might be the network latency instead of the dynamic programming algorithm! Do one of the following:
- Use a
Stopwatch
orkotlin.system.measureTimeMillis
to specifically isolate the time taken after the input is read (after all calls toreadLine
) and before any output is printed. Remember to remove the time printing statements before submitting the final version of your code. - Or, copy all files to a folder on a local hard drive.
- Or, report your problems and the data you obtained.
- Use a
- When you run out of physical memory, your operating system may start using your hard drive as another form of storage. Accessing information from the hard drive is substantially slower than main memory, and you may be observing this effect. Avoid running extraneous complicated programs (media players, file sharing clients, word processors, web browsers, etc) while doing the timing tests if this seems to be a problem.
- Make sure you are using output redirection or piping (as in the examples above) to prevent printing text from becoming a bottleneck.
- Very small test cases are hard to use since the Java virtual machine takes a nontrivial amount of time to start, and since the processor "cache" may make small test cases run an order of magnitude faster than expected. If in doubt, use the test cases that take between 0.1 and 10.0 seconds.
Testing and Debugging
Testing. To help you check the part of your program that generates the alignment, there are many test files in the sequence
directory.
- Many of the small files are designed so that it is easy for you to determine what the correct answer should be by hand. Test your program on these cases to see that it gets these easy cases right.
Here are the optimal edit distances of several of the supplied files.
ecoli2500.txt 118 ecoli5000.txt 160 fli8.txt 6 fli9.txt 4 fli10.txt 2 ftsa1272.txt 758 gene57.txt 8 stx1230.txt 521 stx19.txt 10 stx26.txt 17 stx27.txt 19
- The test case worked through as an example in the assignment description, which is the same as the
example10.txt
file, has a unique optimal alignment. (Some test inputs like "xx y
" have more than one optimal alignment.) So your code should give the exact same output onexample10.txt
as in the assignment page. Here are two more test cases with unique optimal alignments:
$ kotlin DPEditDistanceKt < sequence/endgaps7.txt $ kotlin MemEditDistanceKt < sequence/fli10.txt Edit distance = 4 Edit distance = 2 a - 2 T T 0 t t 0 G G 0 a a 0 G G 0 t t 0 C T 1 t t 0 G G 0 a a 0 G G 0 t t 0 A T 1 - a 2 A A 0 C C 0 T T 0
Enrichment
- The idea of dynamic programming was first advanced by Bellman (1957). Levenshtein (1966) formalized the notion of edit distance. Needleman-Wunsch(1970) were the first to apply edit distance and dynamic programming for aligning biological sequences, and our algorithm is essentially the one proposed in their seminal paper. The widely-used Smith-Waterman(1981) algorithm is quite similar, but solves a slightly different problem (local sequence alignment instead of global sequence alignment).
- The same technology is employed in spell checkers and to identify plagiarism in many courses.
- The genetic data are taken from GenBank. The National Center for Biotechnology Information also contains many examples of such database and alignment software.
- With a little work, you can compute the optimal cost in quadratic time but using only linear space (do we need the whole
opt
matrix?) With more work, you can also compute the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).