8-3 Edit Distance

[english][all]

(請注意:中文版本並未隨英文版本同步更新!)

ED 是另一種計算計算兩個字串之間距離的方法,其目標是在計算第一個字串要經過多少次的刪除(Delete)、插入(Insert)、代換(Substitute)等三個基本字串運算,才能得到第二個字串。

We can also use the concept of DP to compute the edit distance between two strings. Let the optimum-value function ED(a, b) defined as the edit distance between strings a and b. Then the recursive formula for ED is shown next, assuming the costs of "delete" and "insert" are 1, while "substitute" is 2.

  1. ED(ax, by) = ED(a, b) if x = y.
  2. ED(ax, by) = min(ED(a, b)+2, ED(ax, b)+1, ED(a, by)+1) if x ≠ y.
相關的邊界條件則是:ED(a, []) = len(a), ED([], b) = len(b)。

Hint
一般我們在 DOS 使用 fc 或是在 UNIX 使用 diff 來比較兩個文字檔案的不同處,就是使用 ED 來進行快速比對。

The following example demonstrates a typical result of edit distance of strings "prosperity" and "properties":

Example 1: editDistance01.mstr1 = 'prosperity'; str2 = 'properties'; substituteCost=2; plotOpt = 1; [minDist, edPath, edTable] = editDistance(str1, str2, substituteCost, plotOpt);

As shown in the plot, the circle positions indicate common characters.

The next example demonstrates the result of edit distance between "execution" and "intention", which is the example used in 3.11 of "Speech and Language Processing".

Example 2: editDistance02.mstr1 = 'execution'; str2 = 'intention'; substituteCost=2; plotOpt = 1; [minDist, edPath, edTable] = editDistance(str1, str2, substituteCost, plotOpt);


Data Clustering and Pattern Recognition (資料分群與樣式辨認)