[Algorithm series 14] string matching Morris-pratt string search algorithm

Source: Internet
Author: User

Objective

As we have seen before, brute force string matching algorithm and rabin-karp string matching algorithm are not effective algorithms. However, in order to improve an algorithm, we first need to understand its rationale in detail. We already know that violent string matching is slow and has tried to improve it using a hash function in Rabin-karp. The problem is that the complexity of the rabin-karp is the same as the brute force string, both O (MN).

We obviously need to adopt a different approach, but in order to come up with this different approach, let's look at what's wrong with the violent string match. In fact, we can find out the answer to the problem by studying its basic principles in depth.

In a brute-force matching algorithm, it is necessary to check whether each character in the text string matches the first character of the pattern string. If matched, the second character of the pattern string is compared sequentially, and the next character of the string is judged to match. The problem is that when a mismatch occurs, we have to roll back several positions in the text string. Well, this approach is virtually impossible to optimize.


In a brute-force string matching algorithm, if a mismatch occurs, it must be rolled back and match the characters that have already been matched!

We can see from the problem: Once a mismatch occurs, it must be rolled back, starting from a position in the text string that has been examined. In our example, we have checked the 第一、二、三、四个 character, and there is a mismatch between the pattern string and the text string, so ... So we had to go back and start the comparison again from the second character of the text string.

This process obviously has no effect because we already know that the pattern string starts with the character "a" and that there is no such character between position 1 and position 3. So how do we improve this unnecessary repetition?

Overview

James H. Morris and Vaughan Pratt answered the question in 1977 and introduced their own algorithms, which skip a lot of useless comparisons, so they are more efficient than violent string matches. Let's take a look at it in detail. The only thing to do is take advantage of the information gathered during the comparison between the pattern string and the possible match (the only thing was to use the information gathered during the comparisons of the pattern and A possible match) as shown in.


Morris-pratt moves forward to the next possible matching position, skipping some unnecessary comparisons!

The first thing we need to do is to preprocess the pattern string to get a possible location for subsequent matches. Next, we start looking for possible match locations, and in the case of mismatches, we can know exactly where to jump to, skipping those comparisons that are not useful.

Generate subsequent comparison position table

This is the most skillful place in the Morris-pratt algorithm, and it is also an important step to overcome the defect of brute force string matching algorithm. Let's take a look at some pictures.


Obviously, if the pattern string contains only different characters, we should compare the next character in the text string with the first character of the pattern string in the event of a mismatch!

However, if there is a repeating character condition in the pattern string, if there is a mismatch after that character, you must start looking for a possible match from that repeating character, as shown in.


If the pattern string contains repeating characters, the next position table will be slightly different!

Finally, if there are more than 1 repeating characters in a text string, the "next" table will give its position.

With this table containing the "next" possible location, you can begin to find the pattern string in the text string.

Realize

The implementation of the Morris-pratt algorithm is not difficult. First, you must preprocess the pattern string and then perform a search. The original text is implemented using PHP, which we use in C + +.

/*--------------------------------* Date: 2015-02-05* sjf0115* title: Morris-pratt matching algorithm for string matching * blog:------------------- -----------------*/#include <iostream>usingnamespaceStd//Pretreatmentvoid Preprocessmorrispratt (stringPatttern,intNexttable[]) {inti =0;intj = nexttable[0] = -1;int size= Patttern.size(); while(I <size){ while(J >-1&& patttern[i]! = Patttern[j]) {j = nexttable[j]; }//whileNexttable[++i] = ++j; }//while}intSubString (string text,stringPattern) {intm = pattern.size();intn =text.size();intnexttable[m+1];//PretreatmentPreprocessmorrispratt (pattern,nexttable);inti =0, j =0; while(J < N) { while(I >-1&& Pattern[i]! =text[j])        {i = nexttable[i]; }//whilei++; j + +;if(I >= m) {returnJ-i; }//if}//while    return-1;}intMain () {string text("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque eleifend nisi viverra ipsum elementum porttitor quis at Justo. Aliquam ligula felis dignissim sit amet lobortis eget lacinia ac augue. Quisque nec est elit, nec ultricies magna. Ut mi libero, dictum sit amet mollis non, aliquam et augue! ");stringPattern"Mollis");intresult = SubString (text, pattern);//275cout<<"Subscript Location"<<result<<endl;return 0;}

Complexity of

This algorithm requires a certain amount of time and space for preprocessing. The preprocessing of the pattern string can be done within O (M), where m is the length of the pattern string, and the search itself requires O (m+n). The good news is that the preprocessing process only needs to be completed once, and then you can perform any search as needed!

The diagram below shows the O (n+m) Complexity of the 5-letter pattern string and compares it to O (nm).

Application

Advantages

Its search complexity is O (m+n), faster than brute force algorithm and RABIN-KARP algorithm
Its implementation is fairly easy

Disadvantages

Requires additional space and Time-O (m) for preprocessing
can be slightly optimized (Knuth-morris-pratt)

Conclusion

Obviously, this algorithm is useful because it improves the brute-force matching algorithm in a very elegant way. On the other hand, we have to be aware of a faster string lookup algorithm, such as the Boyer-moore algorithm. However, the Morris-pratt algorithm is useful in many cases, so it may be convenient to understand its rationale.

Original link

Computer Algorithms:morris-pratt String Searching

Related Posts

[Ten-second algorithm series] match the brute force of the string
[13 of algorithm series] Rabin-karp String Lookup algorithm

[Algorithm series 14] string matching Morris-pratt string search algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.