Longest Common Subsequence

San Skulrattanakulchai

October 30, 2019

Notation

A Kotlin string is implemented as an array of characters.
Individual characters in a Kotlin string can be accessed through a nonnegative integer index using the [] notation.
For a string $X$, we write $X[i..j)$ to denote the substring of $X$ consisting of all characters from position $i$ to position $j-1$ inclusive. (Note that this is not a Kotlin notation, it's our convenient notation for talking about substrings.)
For example, if $X$ is AFRICA, then $X[1..4)$ is FRI and $X[0..6)$ is AFRICA itself.

Definitions

If one deletes characters at certain positions from a string $X$, what remains is called a subsequence of $X$.
For example, deleting the characters at positions 1 and 4 from AFRICA leaves us with the string ARIA. So we may say that ARIA is a subsequence of AFRICA.
Deleting no characters is also permitted. So, for instance, AFRICA is considered a subsequence of itself.
A sequence $Z$ is a common subsequence of sequences $X$ and $Y$ if $Z$ is a subsequence of both $X$ and $Y$.
For instance, DIN is a common subsequence of DYNAMICPROGRAMMING and DIVIDEANDCONQUER.
A longest common subsequence (LCS) of sequences $X$ and $Y$ is a common subsequence of $X$ and $Y$ of maximum possible length.
For instance, DYNAMICPROGRAMMING and DIVIDEANDCONQUER have DICOR as an LCS. This is because DICOR is their common subsequence and no common subsequence of length 6 exists.
DICON is another LCS, so LCS's are not unique.

Problem

Let two sequences $X=X_0X_1\dots X_{m-1}$ and $Y=Y_0Y_1\dots Y_{n-1}$ be given. We want to find an LCS of $X$ and $Y$.

Trying to solve the problem in a straightforward manner, we could do the following

we keep track of the current longest LCS found so far
for each subsequence Z of X do
    if Z is also a subsequence of Y then
        update the current longest LCS if needed

However, this algorithm has to generate more than an exponential number of pairs of subsequences and check them for equality—an enormous amount as a function of m and n. We need a better way.

Solution by dynamic programming

For $0\le i < m$ and $0\le j < n$, let OPT$(i, j)$ be the length of an LCS of $X[i..m)$ and $Y[j..n)$.
We seek OPT$(0, 0)$.

Optimal substructure property

Suppose $Z=Z[0..k)$ is an LCS of $X[i..m)$ and $Y[j..n)$.
If $X_i=Y_j$, then necessarily $X_i=Z_0$. We can show that $Z[1..k)$ is an LCS of $X[i+1 .. m)$ and $Y[j+1 .. n)$.
If $X_i\ne Y_j$, then $X_i\ne Z_0$ or $Y_j\ne Z_0$.
- If $X_i\ne Z_0$, we can show that $Z$ is an LCS of $X[i+1..m)$ and $Y[j..n)$.
- If $Y_j\ne Z_0$, we can show that $Z$ is an LCS of $X[i..m)$ and $Y[j+1..n)$.
We know that one of the above cases must occur.
This gives us the following recurrence. \[ \mbox{OPT}(i, j) = \left\{ \begin{array}{ll} 0, & \mbox{if $i=m$ or $j=n$} \\ \mbox{OPT}(i+1, j+1) + 1, & \mbox{if $0 \le i < m$, and $0 \le j < n$, and $X_i = Y_j$} \\ \max\{\,\mbox{OPT}(i,j+1),\ \mbox{OPT}(i+1,j)\,\}, & \mbox{if $0\le i < m$, and $0\le j < n$, and $X_i\ne Y_j$} \end{array} \right. \]

Example OPT table

	M	A	R
A	2	2	1
P	1	1	1
R	1	1	1
I	0	0	0
L	0	0	0
	0	0	0

LCS algorithm

[Step 1] Fill in a table of OPT$(\cdot, \cdot)$ values. This can be done row-by-row, column-by-column, or diagonal-by-diagonal.
[Step 2] Find the LCS by following maximizer pointers, starting from OPT$(0, 0)$.
Note: The maximizers can be found by pre-computation and kept in a table in Step 1, or they can be computed on the fly in Step 2.