Longest Common Subsequence: Difference between revisions
No edit summary |
No edit summary |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= | {{Infobox Algorithm|runtime=O(m*n)|space=O(m*n)|class=[[Dynamic Programming]]}} | ||
A sequence ''Z'' is a subsequence of another sequence ''X'' if all the elements in ''Z'' also appear in order in ''X''. Notably, the elements do ''not'' need to be consecutive. | A sequence ''Z'' is a subsequence of another sequence ''X'' if all the elements in ''Z'' also appear in order in ''X''. Notably, the elements do ''not'' need to be consecutive. | ||
In the '''longest-common-subsequence problem (LCS)''', we wish to find a maximum-length common subsequence of ''X'' and ''Y''. | In the '''longest-common-subsequence problem (LCS)''', we wish to find a maximum-length common subsequence of ''X'' and ''Y''. | ||
= Overview = | |||
* '''Runtime complexity:''' <math>O(m*n)</math> | |||
* '''Space complexity:''' <math>O(m*n)</math> | |||
* '''Approach:''' Dynamic programming | |||
= Approach: dynamic programming = | = Approach: dynamic programming = | ||
Consider <math>X = \langle x_1, x_2, \ldots, x_m \rangle</math> and | Consider <math>X = \langle x_1, x_2, \ldots, x_m \rangle</math> and | ||
Line 12: | Line 17: | ||
let <math>Z = \langle z_1, z_2, \ldots, z_k \rangle</math> be an LCS of | let <math>Z = \langle z_1, z_2, \ldots, z_k \rangle</math> be an LCS of | ||
<math>X</math> and <math>Y</math>. | <math>X</math> and <math>Y</math>. | ||
'''Property 1''' | |||
If <math>x_m = y_n</math>, then <math>z_k = x_m = y_n</math> and | If <math>x_m = y_n</math>, then <math>z_k = x_m = y_n</math> and | ||
<math>Z_{k-1}</math> is an LCS of <math>X_{m-1}</math> and | <math>Z_{k-1}</math> is an LCS of <math>X_{m-1}</math> and | ||
<math>Y_{n-1}</math>. | <math>Y_{n-1}</math>. | ||
'''Property 2''' | |||
If <math>x_m \neq y_n</math> and <math>Z_k \neq x_m</math>, then | If <math>x_m \neq y_n</math> and <math>Z_k \neq x_m</math>, then | ||
<math>Z</math> is an LCS of <math>X_{m-1}</math> and | <math>Z</math> is an LCS of <math>X_{m-1}</math> and | ||
<math>Y</math>. | <math>Y</math>. | ||
'''Property 3''' | |||
If <math>x_m \neq y_n</math> and <math>Z_k \neq y_n</math>, then | If <math>x_m \neq y_n</math> and <math>Z_k \neq y_n</math>, then | ||
<math>Z</math> is an LCS of <math>X</math> and | <math>Z</math> is an LCS of <math>X</math> and | ||
<math>Y_{n-1}</math>. | <math>Y_{n-1}</math>. | ||
=== Subproblems === | |||
Let us have <math>c[i,j]</math> be the length of the LCS for | |||
<math>X_i</math> and <math>Y_j</math>. We have | |||
'''Property 4''' | |||
<math> | |||
c[i,j] = \begin{cases} | |||
0 & i = 0 or j = 0 \\ | |||
c[i-1, j-1] + 1 & x_i = y_i \\ | |||
max(c[i-1, j], c[i, j - 1 ]) & x_i \neq y_i | |||
\end{cases} | |||
</math> | |||
The first line is trivial. The second line stems from property 1 and | |||
makes enough sense to me. | |||
The third line stems from property 3. This property tells us that | |||
depending on the last <math>Z_k</math>, <math>Z</math> must either be | |||
the LCS of <math>X_{m-1}</math> and <math>Y</math> or the LCS of | |||
<math>X</math> and <math>Y_{n-1}</math>. We don't know which one it is, | |||
so we simply compare them. Whichever is greater would be the LCS. | |||
= Implementation = | |||
Based on property 4, we have the following DP algorithm | |||
<pre> | |||
LCS-Length(X, Y): | |||
m = X.length | |||
n = Y.length | |||
let c[0...m, 0...n] be a new table // cache | |||
// Initialize c (property 4.1) | |||
for i = 0 to n: | |||
c[0, i] = 0 | |||
for i = 0 to m: | |||
c[i, 0] = 0 | |||
let traceback[0...m, 0...n] be a new table | |||
// Construct c top -> down, left -> right | |||
for row = 1 to n: | |||
for col = 1 to m: | |||
if (X[col] == Y[row]): | |||
// Property 4.2 | |||
c[col, row] = c[col - 1, row - 1] + 1 | |||
traceback[col, row] = (col - 1, row - 1) | |||
continue | |||
else: | |||
// Property 4.3 | |||
left = c[col - 1, row] | |||
top = c[col, row - 1] | |||
if (left > top): | |||
c[col, row] = c[col - 1, row] | |||
traceback[col, row] = (col - 1, row) | |||
else: | |||
c[col, row] = c[col, row - 1] | |||
traceback[col, row] = (col, row - 1) | |||
return (c, traceback) | |||
</pre> | |||
Printing shouldn't be that hard. You just trace back and record whenever | |||
they are the same value. I think. | |||
= Optimizations = | |||
We can reduce the asymptotic space requirements since it only needs one | |||
row and column at a time (although stopping us from reconstructing the | |||
actual subsequence). | |||
We can also remove the entire traceback array and just compare which | |||
value is used for the <math>c</math> calculation. This halves the space | |||
requirement but doesn't change anything asymptotically. | |||
[[Category:Algorithms]] |
Latest revision as of 04:41, 11 March 2024
A sequence Z is a subsequence of another sequence X if all the elements in Z also appear in order in X. Notably, the elements do not need to be consecutive.
In the longest-common-subsequence problem (LCS), we wish to find a maximum-length common subsequence of X and Y.
Overview
- Runtime complexity:
- Space complexity:
- Approach: Dynamic programming
Approach: dynamic programming
Consider and , and let be an LCS of and .
Property 1
If , then and is an LCS of and .
Property 2
If and , then is an LCS of and .
Property 3
If and , then is an LCS of and .
Subproblems
Let us have be the length of the LCS for and . We have
Property 4
The first line is trivial. The second line stems from property 1 and makes enough sense to me.
The third line stems from property 3. This property tells us that depending on the last , must either be the LCS of and or the LCS of and . We don't know which one it is, so we simply compare them. Whichever is greater would be the LCS.
Implementation
Based on property 4, we have the following DP algorithm
LCS-Length(X, Y): m = X.length n = Y.length let c[0...m, 0...n] be a new table // cache // Initialize c (property 4.1) for i = 0 to n: c[0, i] = 0 for i = 0 to m: c[i, 0] = 0 let traceback[0...m, 0...n] be a new table // Construct c top -> down, left -> right for row = 1 to n: for col = 1 to m: if (X[col] == Y[row]): // Property 4.2 c[col, row] = c[col - 1, row - 1] + 1 traceback[col, row] = (col - 1, row - 1) continue else: // Property 4.3 left = c[col - 1, row] top = c[col, row - 1] if (left > top): c[col, row] = c[col - 1, row] traceback[col, row] = (col - 1, row) else: c[col, row] = c[col, row - 1] traceback[col, row] = (col, row - 1) return (c, traceback)
Printing shouldn't be that hard. You just trace back and record whenever they are the same value. I think.
Optimizations
We can reduce the asymptotic space requirements since it only needs one row and column at a time (although stopping us from reconstructing the actual subsequence).
We can also remove the entire traceback array and just compare which value is used for the calculation. This halves the space requirement but doesn't change anything asymptotically.