Experience the KMP algorithm

Source: Internet
Author: User
Tags string back
The greatest benefit of the KMP string search (matching) algorithm is not that it is faster than strstr, but that it does not backtrack. This is a wonderful feature. This means that the target text can be searched only by providing a function to get the next character (in WINX, this function is called get. This is undoubtedly a very beneficial thing for KMP algorithm customers.

In C/C ++ programming, the general string search operations are completed through the strstr () function of the standard library, because there are not many string search operations, there is no efficiency problem. In fact, the time complexity of this function is not optimistic. If you want to find a substring whose length is m from a string whose length is n, the worst time complexity of this strstr () function is O (n * m, as the length of the substring m increases, the time complexity of the strstr () function doubles accordingly. Is there any more efficient algorithm?

The KMP (Knuth-Morris-Pratt) algorithm uses the backtracking index of the corresponding characters in the pre-calculated mode string to avoid unnecessary backtracking operations during pattern matching, thus improving the efficiency, change the time complexity to O (m + n ).

The greatest benefit of the KMP string search (matching) algorithm is not that it is faster than strstr, but that it does not backtrack. This is a wonderful feature. This means that the target text can be searched only by providing a function to get the next character (in WINX, this function is called get. This is undoubtedly a very beneficial thing for KMP algorithm customers.

In general, the KMP string search (matching) algorithm of WINX is easy to use. The only thing you need to note is that, unlike the general matching algorithm, after WINX matches successfully, the current position in the target text) it points to the end of the matched string rather than the start. For example, in the strstr ("1234 abcdefg", "abc") of the C library, the returned result is 'A' in "abcdefg '. The KMP algorithm of WINX returns 'D' in "defg '.

The indexOf method in the String class of Java SDK does not use KMP for search, which is basically the simplest search:

    /**     * Code shared by String and StringBuffer to do searches. The source is the     * character array being searched, and the target is the string being     * searched for.     *      * @param source     *            the characters being searched.     * @param sourceOffset     *            offset of the source string.     * @param sourceCount     *            count of the source string.     * @param target     *            the characters being searched for.     * @param targetOffset     *            offset of the target string.     * @param targetCount     *            count of the target string.     * @param fromIndex     *            the index to begin searching from.     */    static int indexOf(char[] source, int sourceOffset, int sourceCount,            char[] target, int targetOffset, int targetCount, int fromIndex) {        // if start from a position that is beyond the source string        if (fromIndex >= sourceCount) {            // return the string length if target string is empty, otherwise,            // return -1 which means match fails            return (targetCount == 0 ? sourceCount : -1);        }        // correct the fromIndex        if (fromIndex < 0) {            fromIndex = 0;        }        // if target string is empty, return fromIndex        if (targetCount == 0) {            return fromIndex;        }        // first char to match        char first = target[targetOffset];        /*         * a little optimize. let's say the source string length is 9 and the         * target String length is 7. Then starting from 3 (index is 2) of         * source string is the last change to match the whole target sting.         * Otherwise, there are only 6 characters in source string and it would         * definitely not going to match the target string whose length is 7.         */        int max = sourceOffset + (sourceCount - targetCount);        // loop from the first to the max        for (int i = sourceOffset + fromIndex; i <= max; i++) {            /* Look for first character. */            if (source[i] != first) {                // using i <= max, not i < max                while (++i <= max && source[i] != first)                    ;            }            /* Found first character, now look at the rest of v2 */            if (i <= max) {                int j = i + 1;                int end = j + targetCount - 1;                // using j < end, not j <= end                for (int k = targetOffset + 1; j < end                        && source[j] == target[k]; j++, k++)                    ;                if (j == end) {                    /* Found whole string. */                    return i - sourceOffset;                }                // if match fails, i++ and loop again, there are to iterators                // for two loops. i and j.            }        }        return -1;    }

This is a search algorithm in Java String. two pointers are used to search the original String. But in essence, this algorithm still involves backtracking. it can be seen that j will search for a position greater than I during each search, and if the search fails, then the next search will start with I ++, which is backtracking.

The advantage of KMP is that there is no backtracing. this is not only an efficient advantage, but also a more natural implementation when only one pointer can be used for search. Of course, there is no inconvenience to use the two pointers for arrays. if you search for files or input streams, it will be very troublesome to trace back. The following is KMP search.

The core of the KMP algorithm is not to backtrack the original string pointer, which is not hard to achieve. it is important to think of this point-the backtracking characters are actually known. For example, if you search for "abcdeg" in "abcdefg", the first five characters "abcdeg" are matched, and the sixth character f and g do not match. at this time, for the above search algorithm, I will be + 1, and the entire match will start again, which is the backtracking. But if you think about it, backtracking can be completely avoided, because if you know that the sixth character does not match, it means that the first five characters are matched, in this example, we certainly know that the first five characters of the source string are "abcde ". This is the foundation of KMP search.

Okay. let's discard the open source string! We only care about the target string, that is, "abcdeg ". Let's imagine what it means if the match between the [n] character of the source string and the [m] character of the target string fails to be found in the search? It indicates that all the previous characters match, otherwise it will not come here. That is, the m characters from the source string [n-m] to [n-1] match the m characters from [0] to m-1] of the target string. Since we already know this equality relationship before searching, why do we have to trace it again and again? This is something that can be predicted once. Because [n-m] to [n-1] of the source string are known. Therefore, you do not need to backtrack the source string n-m + 1 each time.

For example, if you search for "ababc" in "abababc", the first mismatch is as follows:

0 1 2 3 4 5 6a b a b a b ca b a b c        ^

At this time, it is meaningless to trace the pointer back to the 1 position of the source string because it is B and does not match a of the target string. In addition, we know that the values of the four characters 0 to 3 in the source string are the same as those of the four characters in the target string. they are all abab. The idea of KMP is to make full use of this known condition, "the source string does not backtrack, try to make the target string less backtrack, and then continue searching ". So where should the target string be traced back? This is the content of the matched string.

S indicates the source string, T indicates the target string, and S [n] and T [m] are not matched (Note: Due to the mismatch, at this time, S [n] is unknown ). The source string is only known to be from S [n-m + 1] to S [n-1. Suppose we can find such a k, so that S [n-k]... S [n-1] = T [0]... T [k-1] (0

In the preceding example, the value of k is 2, and the next status of KMP search is:

0 1 2 3 4 5 6a b a b a b c    a b a b c        ^

Then, the matching is successful.

Therefore, the core of the KMP algorithm is how to find a K value for each position of the target string to form an array F. Fortunately, each time the m mismatch of the target string is matched, trace the target string back to F [m] and continue matching. After finding this array, KMP search is 80% complete.

The following is the method for constructing the array F.

At this time, the target string has two roles: the source string and the target string. Building an array T is a step-by-step process that requires the previous results. First, F [0], F [0] means that the first character does not match, that is to say, there is no idea about the source string. at this time, you have to move the source string forward. In F, we use-1 to mark the first character as a match failure. That is, F [0] =-1. F [1] is actually 0. What we really need to calculate is from F [2] to the end. The following is the calculation method when the value is greater than or equal to 2. Note: F [I] indicates the index value T needs to be traced back when the I character of S matches "failed. How can I calculate the value of F [I? First obtain the value of F [i-1], then see S [i-1] whether = T [F [i-1], if equal, then F [I] = F [i-1] + 1. This principle is recursive. The value of F [i-1] is the value that T index traces back to when the I-1 is not matched. if at this time, this value is equal to S [i-1, it means that F [I] can be added 1 on the basis of F [i-1. Otherwise, check whether S [i-1] is equal to T [[F [i-1] until no Search is available, that is, 0. The specific code is as follows:

/**     * each value of array rollback means: when source[i] mismatch pattern[i],     * KMP will restart match process form rollback[j] of pattern with     * source[i]. And if rollback[i] == -1, it means the current source[i] will     * never match pattern. then i should be added by 1 and j should be set to     * 0, which means restart match process from source[i+1] with pattern from     * pattern[0].     *      * @param pattern     * @return     */    private static int[] getRollbackArray(char[] pattern) {        int[] rollback = new int[pattern.length];        for (int i = 0; i < pattern.length; i++) {            rollback[i] = 0;        }        rollback[0] = -1;        for (int i = 1; i < rollback.length; i++) {            char prevChar = pattern[i - 1];            int prevRollback = i - 1;            while (prevRollback >= 0) {                int previousRollBackIdx = rollback[prevRollback];                if ((previousRollBackIdx == -1)                        || (prevChar == pattern[previousRollBackIdx])) {                    rollback[i] = previousRollBackIdx + 1;                    break;                } else {                    prevRollback = rollback[prevRollback];                }            }        }        return rollback;    }

It is not mentioned above that F [1] = 1 is written as a fixed one. However, according to the calculation, F [1] is always = 0. With this rollback array, KMP search is a result of the following:

/**     * search pattern chars in source chars.     *      * @param source     * @param pattern     * @return     */    public static int searchKMP(char[] source, char[] pattern) {        // validation        if (source == null || source.length == 0 || pattern == null                || pattern.length == 0) {            return -1;        }        // get the rollback array.        int[] rollback = getRollbackArray(pattern);        // incremental index of pattern. pointing the char to compare with.        int currMatch = 0;        int len = pattern.length;        // i point the char to compare with        for (int i = 0; i < source.length;) {            // if current char match            if ((currMatch == -1) || (source[i] == pattern[currMatch])) {                /*                 * then each of the indexes adding by one, moving to the next                 * char for comparation. notice that if currMatch is -1, it                 * means the first char in pattern can not be matched. so i add                 * by one to move on. and currMatch add by one so its value is                 * 0.                 */                i++;                currMatch++;                /*                 * if reaches the end of pattern, then match success, return the                 * index of first matched char.                 */                if (currMatch == len) {                    return i - len;                }            } else {                /*                 * if current char mismatch, then rollback the next char to                 * compare in pattern.                 */                currMatch = rollback[currMatch];            }        }        return -1;    }

The following are several testing methods:

    @Test    public void testRollBackArray() {        int[] expectedRollback = new int[] { -1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0,                0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0 };        int[] rollback = getRollbackArray("PARTICIPATE IN PARACHUTE"                .toCharArray());        Assert.assertArrayEquals("Rollback array compare failed to match!",                expectedRollback, rollback);    }    @Test    public void testKMPSearchMatch() {        int matchIndex = searchKMP(                "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),                "ababacb".toCharArray());        Assert.assertEquals(5, matchIndex);        matchIndex = searchKMP(                "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),                "aaaaaababacbaslierjalsdzmflkasjf".toCharArray());        Assert.assertEquals(0, matchIndex);    }    @Test    public void testKMPSearchNoMatch() {        int matchIndex = searchKMP("ABCABCDABABCDABCDABDE".toCharArray(),                "hjABCDABD".toCharArray());        Assert.assertEquals(-1, matchIndex);    }

Put the three pieces of code in a class, and KMP search is complete.

Before reading the KMP algorithm, many articles have said that KMP has a cost. it is only applicable to cases where the target string is long and the search string is long. However, as I can see, KMP is also advantageous for daily searches. First, constructing a rollback array is not complicated. of course, an additional array space is required. However, for matching, there is still a great acceleration advantage, and the target string does not need to be traced back. Therefore, the only cost of KMP is an extra array. the actual memory occupied should be twice that of the target String (String is a char array, char = short, int is twice that of char ). Is it true that KMP search is not used to save memory?

This article is available at http://www.nowamagic.net/librarys/veda/detail/1137.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.