Global Sequence Alignment
Start: 10/27/2016
Due: 11/9/2016, by the beginning of class
Overview
This assignment is also taken from the projects on the book’s web site. In this case, the assignment is the Global Sequence Alignment. Be sure to read comments below for some clarifications and suggestions.
In order to make sense of the questions in the readme.txt file, you will need to read Section 4.1 of the textbook.
You should work on this project individually.
Getting Started
You should get started by reading through Global Sequence Alignment and work through one or two examples by hand. In other words, come up with some short DNA sequences and compute their edit distance using the table method illustrated in the write-up for the sequences AACAGTTACC and TAAGGTCA. You do not have to hand this in, and you may work with a partner on this part. Note that one of the formulas includes 0/1
; this does not mean the division of 0 by 1, but rather means either 0 or 1 depending on a condition listed earlier.
Required Structure
Instead of giving you a template Java file to work from, we are going to have you create the code more or less from scratch. However, in order to enforce some uniformity (which will make the grading easier), we request that you do certain things:
- The name of the Java project should be “Sequence”.
- The name of the main class should be “EditDistance”.
- You should create a “documents” folder in Sequence where you will put a “readme.txt” file described in the “What you must do/hand in” section below.
- You should create a “lib” folder in Sequence, put stdlib.jar into it, and add
stdlib.jar
to the build path for the project. - You should have a “resources” folder in Sequence (just like in NBody) that contains the contents of sequence.zip from the book’s web site.
Your final program should take a command-line argument specifying an input file, and it should print out the edit distance and an optimal alignment, displayed in a vertical format as in the following example.
Edit distance = 7 A T 1 A A 0 C - 2 A A 0 G G 0 T G 1 T T 0 A - 2 C C 0 C A 1
You will read in the file like you did in Project 3. You will also probably want to use LCS as a model for your programming. Here is the archived Eclipse project for LCS developed by Max including in particular LCS.java. What’s more important, please take a look at the videos of Max live-coding the LCS example:here and here. (The first video presents an easy to follow solution for the LCS problem, the second video fine-tunes the previous solution into a version similar to the textbook. You should follow video 1 closely and make sure you understand the solution. Video 2 is for students who are interested in fine-tuning the LCS solution.)
You can use the following code to read from the data files included in
sequence.zip
.// put the following line at the top of your source code outside the class definition import java.io.FileInputStream; public static void main(String[] args) { try { System.setIn(new FileInputStream("resources/" + args[0])); } catch (Exception e) { System.err.printf("Exception caught: %s\n", e.toString()); System.exit(0); } String x = StdIn.readLine(); String y = StdIn.readLine(); printEditDistance(x, y); }
For running long strings such as those in
ecoli10000.txt
, Java may run out of heap. To increase heap size, you can add the following option to “VM Arguments” in “Run Configuration”. This is below where you usually put arguments for your Java program.-Xmx1000m
Required Methods
The class EditDistance
should correctly implement the following methods:
/**
* @param x a non-null String
* @param y a non-null String
* @return the the edit distance between x and y
*
* This procedure should use a recursive, not dynamic programming, approach
* to compute the edit distance
*/
private static int recursiveEditDistance(String x, String y)
/**
* @param x a non-null String
* @param y a non-null String
* @return the the edit distance between x and y
*
* This procedure should use dynamic programming to compute the edit distance
*/
private static int editDistance(String x, String y)
/**
* @param x a non-null String
* @param y a non-null String
*
* This procedure should use dynamic programming to compute the edit distance
* and print it and an optimal alignment in the vertical format shown in the
* project assignment.
* NOTE: There may be multiple optimal alignments.
* This procedure needs to print one optimal alignment.
*/
private static void printEditDistance(String x, String y)
/**
* @param x a non-null String
* @param y a non-null String
*
* Prints out the edit distance between x and y and the time taken to compute it
* using the recursive version recursiveEditDistance
*/
public static void timeRecursiveEditDistance(String x, String y)
/**
* @param x a non-null String
* @param y a non-null String
*
* Prints out the edit distance between x and y and the time taken to compute it
* using the dynamic programming version editDistance
*/
public static void timeEditDistance(String x, String y)
/**
* @param dnaLength a non-negative int
* @return a random String of length dnaLength comprised of the four chars A, T, G, and C
*/
public static String randomDNAString(int dnaLength)
Gradesheet
We will use this gradesheet when grading your lab.
Submission
You should submit via Moodle. You must submit the whole project with the EditDistance
class (as a zip file). Again, here is our step-by-step instruction. You must also answer the questions in the attached readme.txt file.