Project 5: Global Sequence Alignment
Start: Thursday 10/30; Due: Monday 11/17, by the beginning of class
Overview
This assignment is also taken from the projects on the book's web site. In this case, the assignment is the Global Sequence Alignment. Be sure to read comments below for some clarifications and suggestions.
In order to make sense of the questions in the readme.txt
file, you will need to read Section 4.1 of the textbook.
You should work on this project individually.
Getting Started
You should get started by reading through Global Sequence Alignment and work through one or two examples by hand. In other words, come up with some short DNA sequences and compute their edit distance using the table method illustrated in the write-up for the sequences AACAGTTACC and TAAGGTCA. You do not have to hand this in, and you may work with a partner on this part. Note that one of the formulas includes 0/1
; this does not mean the division of 0 by 1, but rather means either 0 or 1 depending on a condition listed earlier.
Required Structure
Instead of giving you a template Java file to work from, we are going to have you create the code more or less from scratch. However, in order to enforce some uniformity (which will make the grading easier), we request that you do certain things:
- The name of the Java project should be "Project5".
- The name of the main class should be "EditDistance".
- You should create a "documents" folder in Project5 where you will put a "readme.txt" file described in the "What you must do/hand in" section below.
- You should create a "lib" folder in Project5, put
stdlib.jar
into it, and putstdlib.jar
into the build path for the project. - You should have a "resources" folder in Project5 (just like in Project3) that contains the contents of sequence.zip from the book's web site.
- Your final program should take a command-line argument specifying an input file, and it should print out the edit distance and an optimal alignment, displayed in a vertical format as in the following example.
Edit distance = 7 A T 1 A A 0 C - 2 A A 0 G G 0 T G 1 T T 0 A - 2 C C 0 C A 1
You will read in the file like you did in Project3. You will also probably want to use LCS as a model for your programming. (Here is the archived Eclipse project for LCS including in particular LCS.java.)
You can use the following code to read from the data files included in
sequence.zip
.import java.io.FileInputStream; // put this at the top of your source code outside the class definition public static void main(String[] args) { try { System.setIn(new FileInputStream("resources/" + args[0])); } catch (Exception e) { System.err.printf("Exception caught: %s\n", e.toString()); System.exit(0); } String x = StdIn.readLine(); String y = StdIn.readLine(); printEditDistance(x, y); }
- For running long strings such as those in
ecoli10000.txt
, Java may run out of heap. To increase heap size, you can add the following option to "VM Arguments" in "Run Configuration". This is below where you usually put arguments for your Java program.-Xmx1000m
Required Methods
The class EditDistance
should correctly implement the following methods:
/** * @param x a non-null String * @param y a non-null String * @return the the edit distance between x and y * * This procedure should use a recursive, not dynamic programming, approach * to compute the edit distance */ private static int recursiveEditDistance(String x, String y) /** * @param x a non-null String * @param y a non-null String * @return the the edit distance between x and y * * This procedure should dynamic programming to compute the edit distance */ private static int editDistance(String x, String y) /** * @param x a non-null String * @param y a non-null String * * This procedure should use dynamic programming to compute the edit distance * and print it and an optimal alignment in the vertical format shown in the * project assignment. * NOTE: There may be multiple optimal alignments. * This procedure needs to print one optimal alignment. */ private static void printEditDistance(String x, String y) /** * @param x a non-null String * @param y a non-null String * * Prints out the edit distance between x and y and the time taken to compute it * using the recursive version recursiveEditDistance */ public static void timeRecursiveEditDistance(String x, String y) /** * @param x a non-null String * @param y a non-null String * * Prints out the edit distance between x and y and the time taken to compute it * using the dynamic programming version editDistance */ public static void timeEditDistance(String x, String y) /** * @param dnaLength a non-negative int * @return a random String of length dnaLength comprised of the four chars A, T, G, and C */ public static String randomDNAString(int dnaLength)
Gradesheet
Code (50 pts)
recursiveEditDistance
(10 pts)editDistance
(10 pts)printEditDistance
(20 pts)timeRecursiveEditDistance, timeEditDistance, randomDNAString
(10 pts)
Comments and Style (20 pts)
- Adequate comments (5 pts)
- Single copy of code used by
editDistance
andprintEditDistance
(10 pts) - Choice of names, indentation, other style issues (5 pts)
Questions in readme.txt
(30 pts)
- running time of recursiveEditDistance (10 pts)
- distance and timing for the various E. coli (5 pts)
- doubling hypotheis (5 pts)
- running time estimation (10 pts)
Total: 100 pts
Submission
You should submit via Moodle. You must submit the whole project with the EditDistance
class (as a zip file).
Again, here is our step by step instruction.
You must also answer the questions in the attached readme.txt
file.