8-2 Longest Common Subsequence

[chinese][english]

(請注意:中文版本並未隨英文版本同步更新!)

Given a sequence, we can delete any elements to form a subsequence of the original sequence. For instance, given a string s = "uvwxyz", we can delete v and x to get a subsequence "uwyz". For any given two sequences a and b, the similarity between them can be defined as the length of the longest common subsequence (LCS for short) of these two sequences, which can be computed efficiently by DP.

對於任一個字串,我們可以任意刪除其中的幾個字元,剩下的字串就稱為原字串的子字串。例如,假設原字串是s = "uvwxyz",我們可以刪除 v 和 x 來得到子字串 "uwyz"。「最長共同子字串」(Longest common subsequence,簡稱 LCS )的目標是要找出在兩個字串中,共同出現且前後次序一致的子字串。假設 LCS(a, b) 是字串 ab 的最長共同子字串的長度,則我們可以使用 DP 的方法來進行 LCS 的計算。

Let us define the optimum-value function LCS(a, b) as the length of the longest common subsequence between a and b. Then the recursive formula for LCS can be defined as follows:

  1. LCS(ax, by) = LCS(a, b)+1 if x = y.
  2. LCS(ax, by) = max(LCS(ax, b), LCS(a, by)) if x ≠ y.

The boundary conditions are LCS(a, []) = 0, LCS([], b) = 0.

相關的邊界條件則是:LCS(a, []) = 0, LCS([], b) = 0。

The following example demonstrates a typical result of LCS of strings "prosperity" and "properties":

Example 1: lcs01.mstr1 = 'prosperity'; str2 = 'properties'; plotOpt = 1; [lcscount, lcsPath, lcsStr, lcsTable] = lcs(str1, str2, plotOpt);

As shown in the plot, the circle positions indicate common characters.

It should be noted that the identified sequence is not unique for the above example. The identified sequence is "properi" in the above example, but one can verify that "perpert" is also equally good. In other words, a vanilla DP can only find the one of the best solutions. If you want to find all of the best, you need to modify the back tracking procedure. Moreover, if you want to find the top-n solutions, then you need to invoke the "n-best" version of DP which is a lot more complicated.


Data Clustering and Pattern Recognition (資料分群與樣式辨認)