KMP-pattern matching algorithm

Source: Internet
Author: User
<span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);"></span><pre name="code" class="java"><span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">         </span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;">今天我们来聊聊模式匹配算法,什么是模式匹配算法呢,其实就是子字符串匹配上算法。比如字符串<span style="font-family:Times New Roman;">a=</span></span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 'Times New Roman';">”</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;">abcabc</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 'Times New Roman';">”</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;">, 需匹配字符串为<span style="font-family:Times New Roman;">b=</span></span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 'Times New Roman';">”</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;">abc</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 'Times New Roman';">”</span><span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;">,则<span style="font-family:Times New Roman;">b</span>在<span style="font-family:Times New Roman;">a</span>中出现的第一个位置就是<span style="font-family:Times New Roman;">0</span>号位置了,这就算是匹配成功了。在讲kmp算法之前,我们想传统的给你2个字符串,做比较的话,肯定是一个一个的比较,暴力的解决这个问题,我事先也写了一个这样的例子。</span>
 
<span style="background-color: rgb(255, 255, 255); font-size: 10.5pt; font-family: 宋体;"></span><pre name="code" class="java">/** * 普通的模式匹配算法 *  * @param s *            主串 * @param t *            匹配串 */private static int strIndex(String s, String t) {int start = 0;int end = s.length() - t.length() + 1;int k = 0;int index = -1;for (int i = start; i <= end; i++) {// 当前主串的匹配位置k = i;// 找准开始匹配的起时,再依次匹配for (int j = 0; j < t.length(); j++) {if (s.charAt(k) == t.charAt(j)) {k++;} else {break;}}// 如果匹配到t个长度后if (k == i + t.length()) {index = i;break;}}return index;}


 
<span style="font-family:宋体;"><span style="font-size: 14px;">      功能虽然说可以实现了,但是效率自不必说,时间复杂度为O(n*n)级别的,如果碰上超长字符串,类似文章型的检索,都不知道得等到什么时候了。我们总是站在巨人的肩膀上思考问题,这些问题,前辈们早就思考到了,有人就提出了一种KMP的模式匹配算法,首先介绍一下KMP的由来。</span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;">KMP算法之所以叫做KMP算法是因为这个算法是由三个人共同提出来的,(</span></span></span><span style="font-family: tahoma, arial, 宋体; line-height: 24px; text-indent: 28px; font-size: 14px;">由D.E.Knuth与V.R.Pratt和J.H.Morris同时发现,因此人们称它为克努特——莫里斯——普拉特操作(简称KMP算法)</span><span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;">)就取三个人名字的首字母作为该算法的名字。其实KMP算法与暴力算法的区别就在于KMP算法巧妙的消除了指针i的回溯问题,只需确定下次匹配j的位置即可,使得问题的复杂度由O(n*n)下降到O(m+n)。</span></span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;">      我们先来看看原始暴力匹配的过程是怎么样的:</span></span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;"></span></span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;">但是kmp算法根普通匹配算法的最大不同点之处在于,他略过了之前匹配中的相同部分,直接从下一个匹配不同的地方开始,利用已得到的“匹配部分”,向右滑动尽可能远的一段距离。避免了逐一滑动。但是他在里面又定义了种next[]数组的概念,就是next[j] = k,意味着表明模式串中的第j+1个字符串失配时候,在模式串中需重新和目标串中字符si进行比较的位置,不一定失配时j都得从0开始,如果模式串中前k个字符等于模式串中后k个字符,我们就直接从模式串中的k下标开始匹配,因为之前的k个已经是匹配正确的情况下的。<span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;">在KMP算法中,为了确定在匹配不成功时,下次匹配时j的位置,next[j]的值表示s[0...j-1]中最长后缀的长度等于相同字符序列的前缀。意思就是说<span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;">即next[j]=k>0时,表示S[0...k-1]=S[j-k,j-1],<span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;">如果next[j]>=0,则目标串的指针i不变,将模式串的指针j移动到next[j]的位置继续进行匹配,这是为了避免少匹配的情况的发生,因为头尾部部分匹配,也可能出现全部匹配的情况,</span>如果k=0,直接重新j=0开始匹配,匹配的位置则刚刚好是i下标失配的位置。所以后面的任务就是求next数组的活了。</span></span></span></span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;">     那么如何去求next数组呢:</span></span></span></span></span>
<span style="font-family:宋体;"><span style="font-size: 14px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 28px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;"><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 28px;"></span></span></span></span></span><p style="margin: 10px auto; padding-top: 0px; padding-bottom: 0px; line-height: 2; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px;"> 根据定义next[0]=-1,假设next[j]=k, 即T[0...k-1]==T[j-k,j-1]</p><p style="margin: 10px auto; padding-top: 0px; padding-bottom: 0px; line-height: 2; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px;">   1)若T[j]==T[k],则有T[0..k]==T[j-k,j],很显然,next[j+1]=next[j]+1=k+1;</p><p style="margin: 10px auto; padding-top: 0px; padding-bottom: 0px; line-height: 2; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px;">   2)若T[j]!=T[k],k值如何移动,显然k=next[k],这个是我最难理解的一点,我的意思是这相当于把k的值回溯到上一个匹配的值的时候。比如说原本我有3个字符首尾相同,后来多了一个字符串比较不通过时,把变为上次通过的值,这个值可能为2,拿前2个字符和后2个比较,如果不行在回溯一次值,可能最后k就变成0了,说明新比较的值一添加,就不存在相同的部分了,直接j又得从0开始了。代码如下:</p><p style="margin: 10px auto; padding-top: 0px; padding-bottom: 0px; line-height: 2; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 14px;"><pre name="code" class="java"><span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);"></span><pre name="code" class="java">/** * 计算next[]数组的值 *  * @param t *            匹配串 * @return */private static int[] getNext(String t) {int[] next = new int[t.length()];next[0] = -1;int j = 0;int k = -1;while (j < t.length() - 1) {if (k == -1 || t.charAt(j) == t.charAt(k)) {j++;k++;next[j] = k;} else {k = next[k];}}for (int i : next) {System.out.print(i + ": ");}System.out.print("\n");return next;}
所以按照此方法,abcaa,的值的next[]数组的值为-1,0,0,0,1,当第5个字符a不匹配时候,因为第一个a和第4个a相同,所以nextde值为1,j直接从b比较第一a移动到了第四个a的位置上了。相应的kmp算法最终为:
<pre name="code" class="java">/** * kmp模式匹配算法 *  * @param s *            主串 * @param t *            匹配串 * @param next *            next[]数组 */private static int kmpStrIndex(String s, String t, int[] next) {int i = 0;int j = 0;while (i < s.length() && j < t.length()) {if (j == -1 || s.charAt(i) == t.charAt(j)) {i++;j++;} else {// i不变,j后退j = next[j];}if (j == t.length()) {return i - j;}}return -1;}}
The idea of KMP is still very difficult to understand. If you read it for the first time, at least it seems to me like this. You have to think about it over and over again and write it on paper. In fact, I came up with another idea. Isn't there a string matching method in JDK? Yes, it's a string. contains (), but the returned result seems to be a boolean type. What method does it use above? Is it also the KMP algorithm idea?
 
<pre name="code" class="java">/**      当且仅当此字符串包含指定的 char 值序列时,返回 true。        参数:      s - 要搜索的序列       返回:      如果此字符串包含 s,则返回 true,否则返回 false       抛出:       NullPointerException - 如果 s 为 null      从以下版本开始:       1.5       */    public boolean contains(CharSequence s) {        return indexOf(s.toString()) > -1;    }
我们看看下面的indexOf方法
<p style="margin-top: 0px; margin-bottom: 0px; padding-top: 0px; padding-bottom: 0px; font-family: Tahoma; font-size: 14px; line-height: 24px;"><pre name="code" class="java"> public int indexOf(String str) {        return indexOf(str, 0);    }

public int indexOf(String str, int fromIndex) {        return indexOf(value, offset, count, str.value, str.offset, str.count, fromIndex);    }

A bunch of parameter values are passed in this index. It is time to reveal the answer,

/**     * Code shared by String and StringBuffer to do searches. The source is the character array being searched, and the     * target is the string being searched for.     *      * @param source     *          the characters being searched.     * @param sourceOffset     *          offset of the source string.     * @param sourceCount     *          count of the source string.     * @param target     *          the characters being searched for.     * @param targetOffset     *          offset of the target string.     * @param targetCount     *          count of the target string.     * @param fromIndex     *          the index to begin searching from.     */    static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,            int targetCount, int fromIndex) {    //做早期的参数验证和判断,这里的source其实就是主串        if (fromIndex >= sourceCount) {            return (targetCount == 0 ? sourceCount : -1);        }        if (fromIndex < 0) {            fromIndex = 0;        }        if (targetCount == 0) {            return fromIndex;        }         //先找出第一个字符,和计算最大的偏移下标sourceCount - targetCount,        //从这里基本可以看出计算Max的值就是要进行暴力比较了,        char first = target[targetOffset];        int max = sourceOffset + (sourceCount - targetCount);         for (int i = sourceOffset + fromIndex; i <= max; i++) {            /* Look for first character. */        //先找出第一个匹配的地方,避免后面多余的操作            if (source[i] != first) {                while (++i <= max && source[i] != first);            }             /* Found first character, now look at the rest of v2 */            if (i <= max) {                int j = i + 1;                int end = j + targetCount - 1;                //找到之后,进行剩余的比较,又是通过for循环的,根本看不到kmp的影子                for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++);                 if (j == end) {                    /* Found whole string. */                    return i - sourceOffset;                }            }        }        return -1;    }

The results are disappointing. The JDK uses a common method. I don't know if Sun will improve this algorithm in the future. Maybe the goal of the author was to compare simple strings, so many factors have not been taken into consideration. The KMP algorithm has been analyzed so far. I hope you will get some useful information. Finally, I will post an example of my testing today, which is a test class:

package Kmp;/** * 模式匹配算法 *  * @author lyq *  */public class Client {public static void main(String[] args) {// 主串String s = "ababcaabcacbab";// 匹配串String t = "abcaa";// 第一个匹配的位置int position = strIndex(s, t);System.out.println(position);int[] next = getNext(t);position = kmpStrIndex(s, t, next);System.out.println("kmp:" + position);}/** * 普通的模式匹配算法 *  * @param s *            主串 * @param t *            匹配串 */private static int strIndex(String s, String t) {int start = 0;int end = s.length() - t.length() + 1;int k = 0;int index = -1;for (int i = start; i <= end; i++) {// 当前主串的匹配位置k = i;// 找准开始匹配的起时,再依次匹配for (int j = 0; j < t.length(); j++) {if (s.charAt(k) == t.charAt(j)) {k++;} else {break;}}// 如果匹配到t个长度后if (k == i + t.length()) {index = i;break;}}return index;}/** * 计算next[]数组的值 *  * @param t *            匹配串 * @return */private static int[] getNext(String t) {int[] next = new int[t.length()];next[0] = -1;int j = 0;int k = -1;while (j < t.length() - 1) {if (k == -1 || t.charAt(j) == t.charAt(k)) {j++;k++;next[j] = k;} else {k = next[k];}}for (int i : next) {System.out.print(i + ": ");}System.out.print("\n");return next;}/** * kmp模式匹配算法 *  * @param s *            主串 * @param t *            匹配串 * @param next *            next[]数组 */private static int kmpStrIndex(String s, String t, int[] next) {int i = 0;int j = 0;while (i < s.length() && j < t.length()) {if (j == -1 || s.charAt(i) == t.charAt(j)) {i++;j++;} else {// i不变,j后退j = next[j];}if (j == t.length()) {return i - j;}}return -1;} /**     * Code shared by String and StringBuffer to do searches. The source is the character array being searched, and the     * target is the string being searched for.     *      * @param source     *          the characters being searched.     * @param sourceOffset     *          offset of the source string.     * @param sourceCount     *          count of the source string.     * @param target     *          the characters being searched for.     * @param targetOffset     *          offset of the target string.     * @param targetCount     *          count of the target string.     * @param fromIndex     *          the index to begin searching from.     */    static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,            int targetCount, int fromIndex) {    //做早期的参数验证和判断,这里的source其实就是主串        if (fromIndex >= sourceCount) {            return (targetCount == 0 ? sourceCount : -1);        }        if (fromIndex < 0) {            fromIndex = 0;        }        if (targetCount == 0) {            return fromIndex;        }         //先找出第一个字符,和计算最大的偏移下标sourceCount - targetCount,        //从这里基本可以看出计算Max的值就是要进行暴力比较了,        char first = target[targetOffset];        int max = sourceOffset + (sourceCount - targetCount);         for (int i = sourceOffset + fromIndex; i <= max; i++) {            /* Look for first character. */        //先找出第一个匹配的地方,避免后面多余的操作            if (source[i] != first) {                while (++i <= max && source[i] != first);            }             /* Found first character, now look at the rest of v2 */            if (i <= max) {                int j = i + 1;                int end = j + targetCount - 1;                //找到之后,进行剩余的比较,又是通过for循环的,根本看不到kmp的影子                for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++);                 if (j == end) {                    /* Found whole string. */                    return i - sourceOffset;                }            }        }        return -1;    }}


 
 


 

 

 
 

KMP-pattern matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.