Longest Common Subsequence: Difference between revisions

From Rice Wiki
No edit summary
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Problem description =
{{Infobox Algorithm|runtime=O(m*n)|space=O(m*n)|class=[[Dynamic Programming]]}}
 
A sequence ''Z'' is a subsequence of another sequence ''X'' if all the elements in ''Z'' also appear in order in ''X''. Notably, the elements do ''not'' need to be consecutive.
A sequence ''Z'' is a subsequence of another sequence ''X'' if all the elements in ''Z'' also appear in order in ''X''. Notably, the elements do ''not'' need to be consecutive.


In the '''longest-common-subsequence problem (LCS)''', we wish to find a maximum-length common subsequence of ''X'' and ''Y''.
In the '''longest-common-subsequence problem (LCS)''', we wish to find a maximum-length common subsequence of ''X'' and ''Y''.
= Overview =
* '''Runtime complexity:''' <math>O(m*n)</math>
* '''Space complexity:''' <math>O(m*n)</math>
* '''Approach:''' Dynamic programming


= Approach: dynamic programming =
= Approach: dynamic programming =
Line 11: Line 18:
<math>X</math> and <math>Y</math>.
<math>X</math> and <math>Y</math>.


==== Property 1 ====
'''Property 1'''
 
If <math>x_m = y_n</math>, then <math>z_k = x_m = y_n</math> and
If <math>x_m = y_n</math>, then <math>z_k = x_m = y_n</math> and
<math>Z_{k-1}</math> is an LCS of <math>X_{m-1}</math> and
<math>Z_{k-1}</math> is an LCS of <math>X_{m-1}</math> and
<math>Y_{n-1}</math>.
<math>Y_{n-1}</math>.


==== Property 2 ====
'''Property 2'''
 
If <math>x_m \neq y_n</math> and <math>Z_k \neq x_m</math>, then
If <math>x_m \neq y_n</math> and <math>Z_k \neq x_m</math>, then
<math>Z</math> is an LCS of <math>X_{m-1}</math> and
<math>Z</math> is an LCS of <math>X_{m-1}</math> and
<math>Y</math>.
<math>Y</math>.


==== Property 3 ====
'''Property 3'''
 
If <math>x_m \neq y_n</math> and <math>Z_k \neq y_n</math>, then
If <math>x_m \neq y_n</math> and <math>Z_k \neq y_n</math>, then
<math>Z</math> is an LCS of <math>X</math> and
<math>Z</math> is an LCS of <math>X</math> and
Line 31: Line 41:
<math>X_i</math> and <math>Y_j</math>. We have
<math>X_i</math> and <math>Y_j</math>. We have


==== Property 4 ====
'''Property 4'''


<math>
<math>
c[i,j] = \begin{cases}
c[i,j] = \begin{cases}
     0 & i = 0 or j = 0 \\
     0 & i = 0 or j = 0 \\
     c[i-1, j-1] + 1 & x_i = y_i \\
     c[i-1, j-1] + 1 & x_i = y_i \\
     max(c[i-1, j], c[i - 1, j]) & x_i \neq y_i
     max(c[i-1, j], c[i, j - 1 ]) & x_i \neq y_i
\end{cases}
\end{cases}
</math>
</math>
Line 49: Line 59:
<math>X</math> and <math>Y_{n-1}</math>. We don't know which one it is,
<math>X</math> and <math>Y_{n-1}</math>. We don't know which one it is,
so we simply compare them. Whichever is greater would be the LCS.
so we simply compare them. Whichever is greater would be the LCS.


= Implementation =
= Implementation =
Line 89: Line 98:
     return (c, traceback)
     return (c, traceback)
</pre>
</pre>
Printing shouldn't be that hard. You just trace back and record whenever
they are the same value. I think.
= Optimizations =
We can reduce the asymptotic space requirements since it only needs one
row and column at a time (although stopping us from reconstructing the
actual subsequence).
We can also remove the entire traceback array and just compare which
value is used for the <math>c</math> calculation. This halves the space
requirement but doesn't change anything asymptotically.
[[Category:Algorithms]]

Latest revision as of 04:41, 11 March 2024

A sequence Z is a subsequence of another sequence X if all the elements in Z also appear in order in X. Notably, the elements do not need to be consecutive.

In the longest-common-subsequence problem (LCS), we wish to find a maximum-length common subsequence of X and Y.

Overview

  • Runtime complexity:
  • Space complexity: Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(m*n)}
  • Approach: Dynamic programming

Approach: dynamic programming

Consider Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X = \langle x_1, x_2, \ldots, x_m \rangle} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y = \langle y_1, y_2, \ldots, y_n \rangle } , and let Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z = \langle z_1, z_2, \ldots, z_k \rangle} be an LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y} .

Property 1

If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_m = y_n} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle z_k = x_m = y_n} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_{k-1}} is an LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{m-1}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y_{n-1}} .

Property 2

If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_m \neq y_n} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_k \neq x_m} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z} is an LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{m-1}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y} .

Property 3

If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_m \neq y_n} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_k \neq y_n} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z} is an LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y_{n-1}} .

Subproblems

Let us have Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c[i,j]} be the length of the LCS for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_i} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y_j} . We have

Property 4

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c[i,j] = \begin{cases} 0 & i = 0 or j = 0 \\ c[i-1, j-1] + 1 & x_i = y_i \\ max(c[i-1, j], c[i, j - 1 ]) & x_i \neq y_i \end{cases} }

The first line is trivial. The second line stems from property 1 and makes enough sense to me.

The third line stems from property 3. This property tells us that depending on the last Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_k} , Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z} must either be the LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{m-1}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y} or the LCS of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Y_{n-1}} . We don't know which one it is, so we simply compare them. Whichever is greater would be the LCS.

Implementation

Based on property 4, we have the following DP algorithm

LCS-Length(X, Y):
    m = X.length
    n = Y.length
    let c[0...m, 0...n] be a new table  // cache
    // Initialize c (property 4.1)
    for i = 0 to n:
        c[0, i] = 0
    for i = 0 to m:
        c[i, 0] = 0

    let traceback[0...m, 0...n] be a new table

    // Construct c top -> down, left -> right
    for row = 1 to n:
        for col = 1 to m:
            if (X[col] == Y[row]):
                // Property 4.2
                c[col, row] = c[col - 1, row - 1] + 1
                traceback[col, row] = (col - 1, row - 1)
                continue
            else:
                // Property 4.3
                left = c[col - 1, row]
                top = c[col, row - 1]
                if (left > top):
                    c[col, row] = c[col - 1, row]
                    traceback[col, row] = (col - 1, row)
                else:
                    c[col, row] = c[col, row - 1]
                    traceback[col, row] = (col, row - 1)

    return (c, traceback)

Printing shouldn't be that hard. You just trace back and record whenever they are the same value. I think.

Optimizations

We can reduce the asymptotic space requirements since it only needs one row and column at a time (although stopping us from reconstructing the actual subsequence).

We can also remove the entire traceback array and just compare which value is used for the Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c} calculation. This halves the space requirement but doesn't change anything asymptotically.