Longest Common Subsequence

From Rice Wiki
Revision as of 04:06, 4 March 2024 by Rice (talk | contribs)

Problem description

A sequence Z is a subsequence of another sequence X if all the elements in Z also appear in order in X. Notably, the elements do not need to be consecutive.

In the longest-common-subsequence problem (LCS), we wish to find a maximum-length common subsequence of X and Y.

Overview

  • Runtime complexity:
  • Space complexity:
  • Approach: Dynamic programming

Approach: dynamic programming

Consider and , and let be an LCS of and .

Property 1

If , then and is an LCS of and .

Property 2

If and , then is an LCS of and .

Property 3

If and , then is an LCS of and .

Subproblems

Let us have be the length of the LCS for and . We have

Property 4

The first line is trivial. The second line stems from property 1 and makes enough sense to me.

The third line stems from property 3. This property tells us that depending on the last , must either be the LCS of and or the LCS of and . We don't know which one it is, so we simply compare them. Whichever is greater would be the LCS.


Implementation

Based on property 4, we have the following DP algorithm

LCS-Length(X, Y):
    m = X.length
    n = Y.length
    let c[0...m, 0...n] be a new table  // cache
    // Initialize c (property 4.1)
    for i = 0 to n:
        c[0, i] = 0
    for i = 0 to m:
        c[i, 0] = 0

    let traceback[0...m, 0...n] be a new table

    // Construct c top -> down, left -> right
    for row = 1 to n:
        for col = 1 to m:
            if (X[col] == Y[row]):
                // Property 4.2
                c[col, row] = c[col - 1, row - 1] + 1
                traceback[col, row] = (col - 1, row - 1)
                continue
            else:
                // Property 4.3
                left = c[col - 1, row]
                top = c[col, row - 1]
                if (left > top):
                    c[col, row] = c[col - 1, row]
                    traceback[col, row] = (col - 1, row)
                else:
                    c[col, row] = c[col, row - 1]
                    traceback[col, row] = (col, row - 1)

    return (c, traceback)

Printing shouldn't be that hard. You just trace back and record whenever they are the same value. I think.

Optimizations

We can reduce the asymptotic space requirements since it only needs one row and column at a time (although stopping us from reconstructing the actual subsequence).

We can also remove the entire traceback array and just compare which value is used for the calculation. This halves the space requirement but doesn't change anything asymptotically.