Longest Common Subsequence

San Skulrattanakulchai

October 31, 2016

Longest Common Subsequence

A java string is essentially a sequence of characters. For example, the java string constant AFRICA is a character sequence of length 6. Like arrays, characters in a java string are indexed starting from 0. For a string X, we will write X[i..j) to mean the substring of X consisting of all characters from position i to position j − 1 inclusive. For example, if X is AFRICA, then X[1..4) is FRI and X[0..6) is AFRICA itself.

If one deletes characters at certain positions from a given string X, what remains is called a subsequence of X. For example, deleting the characters at positions 1 and 4 from AFRICA leaves us with the string ARIA. So we may say that ARIA is a subsequence of AFRICA. Deleting no characters is also permitted. So, for instance, AFRICA is considered a subsequence of itself.

A sequence Z is a common subsequence of sequences X and Y if Z is a subsequence of both X and Y. For instance, DIN is a common subsequence of DYNAMICPROGRAMMING and DIVIDEANDCONQUER.

A longest common subsequence (LCS) of sequences X and Y is a common subsequence of X and Y of maximum possible length. For instance, DYNAMICPROGRAMMING and DIVIDEANDCONQUER have DICOR as an LCS. This is because DICOR is their common subsequence and no common subsequence of length 6 exists. DICON is another LCS, so LCS’s are not unique.

Problem

Let two sequences X = X0X1Xm − 1 and Y = Y0Y1Yn − 1 be given. We want to find an LCS of X and Y.

Dynamic Programming Solution

For 0 ≤ i < m and 0 ≤ j < n, let opt(i, j) be the length of an LCS of X[i..m) and Y[j..n).

We seek (0, 0).

Optimal Substructure Property

Suppose Z = Z[0..k) is an LCS of X[i..m) and Y[j..n).

If Xi = Yj, then necessarily Xi = Z0. We can then show that Z[1..k) is an LCS of X[i + 1..m) and Y[j + 1..n).

If Xi ≠ Yj, then Xi ≠ Z0 or Yj ≠ Z0. If Xi ≠ Z0, we can show that Z is an LCS of X[i + 1..m) and Y[j..n). If Yj ≠ Z0, we can show that Z is an LCS of X[i..m) and Y[j + 1..n).

We know that one of the above cases must occur. This gives us the following recurrence.

Recurrence

opt(i, j) = 0, if i=m or j=n  
          = opt(i+1, j+1) + 1, if 0<=i<m, and 0<=j<n, and X_i=Y_j  
          = max {opt(i,j+1), opt(i+1,j)}, if 0<=i<m, and 0<=j< n, and X_i is not equal to Y_j  

Example dynamic programming table

. s a i n t .
s 3 2 1 1 1 0
a 2 2 1 1 1 0
t 2 2 1 1 1 0
a 2 2 1 1 0 0
n 1 1 1 1 0 0
. 0 0 0 0 0 0

Longest Common Subsequence Algorithm

Step 1. Fill in a table of opt( ⋅ , ⋅ ) values, plus a companion table of maximizers. We can fill in the table row-by-row, column-by-column, or diagonal-by-diagonal.

Step 2. Find the LCS by following maximizer pointers, starting from opt(0, 0).