"Data structure" string &KMP substring matching algorithm

Source: Internet
Author: User

String

As a way of human-computer interaction, the program is more or less certain to deal with textual information. How to abstract human language information in a computer becomes a problem. The string is the answer to this question. Although formally, the string can be regarded as one of the linear table, the data store of the element is a character from the selected character set, but the string because it as a whole to express the meaning of this feature, show some particularity. It is generally concerned that linear tables are concerned with the relationship between elements and tables and the relationships and operations between elements, and that strings often require some attention and manipulation of the table as a whole.

The basic concepts of strings, such as length, size, substring, and so on, as long as a bit of programming based on the people understand it is not much to say. The abstract data type about a string might be able to draw such an ADT:

ADT string:    string (self,sseq)    #基于字符序列sseq创建字符串    is_empty (self)    #判断是否是空串    len (self)    # Returns the string length    char (self,index)    #返回指定索引的字符    substr (self,start,end)    #对字符串进行切片, returning the tangent substring    match (self, String)    #查找串string在本字符串中第一次出现的位置, no return-1    concat (self,string)    #做出本字符串与另一字符串string的拼接串    subst (SELF,STR1,STR2)    #做出将本字符串中的子串str1全部替换成str2的字符串

The basic operation of the string class is these, most of which are relatively simple, and only the match and subst operations can be complicated (because of the problem of sub-string indexing, which is discussed later).

Basic implementation of strings

Because a string is essentially a linear table, it is easy to think of a sequential table or a linked list to implement a string, depending on the classification of the linear table. In fact, the string can be stored in the middle of the two, the character sequence fragment is saved between a set of storage blocks and linked fields to connect these blocks. In C and other same strain languages, shorter strings are usually implemented in the form of sequential tables. We also know in the sequential table that the separate sequential table and the separate sequential table are the dynamic sequential tables that can be dynamically enlarged. This can be seen in terms of the needs to be fulfilled. If you want the string to be a type that must specify the size when it is created, it can be implemented through an all-in-one sequential table, such as the type of STR in Python that is immutable, so you should use an all-in-one sequential table. In some other languages, it is possible to require a string variable to dynamically change the content, and in this case it is necessary to use a dynamic sequential table.

Also, in terms of the implementation of an immutable string, there is a problem with where the string terminates. We have two solutions, one is to learn the order table in the string to maintain some extra information such as the length of the string, and the second is to automatically add a terminating code at the end of the strings, this encoding can not be used as any explicit characters. The second method is used in the C language and in Python, which inherits C.

With regard to the encoding of strings, the Unicode character set is used by default in younger languages such as python,java to encode strings, while most older languages use ASCII and extended ASCII by default.

A string in Python

The following is a description of the string in Python from the point of view of the data structure and algorithm. Some specific operation methods and properties of strings I've mentioned a lot before, and I can see other articles.

First, the string in Python is an immutable type, and the length and content of the object are fixed at the moment of creation, because the length and nature of different objects may differ, and the representation of a string variable in Python is as follows: A sequence table is roughly divided into three regions, The length information of the string, some other information about the string (such as the information provided to the interpreter, for managing the object), and the character store, respectively.

Some operations of the Str class are grouped into three categories, obtaining information about STR objects (such as len,isdigital, etc.), creating new Str objects based on existing objects or in thin air (such as slicing, formatting, replacing, etc.), string retrieval (such as count,endwith,find, etc.)

In these operations, such as Len, the access character, etc. is clearly an O (1) operation, while the other operation is basically to scan the entire string, including the retrieval, In,not in,isdigital and so on are O (n) operations.

String match

The basic implementation of the string and its operation is similar to the linear table, and it is not much to say. But the emphasis needs to be talked about, the problem of string matching.

String matching seems like a very simple thing to do, but it's actually very learned. The first is its importance, and there are too many places to use it in practical applications. including text-processing lookups, spam filtering, search engine crawls on hundreds of millions of pages, and even DNA testing for four base sequence matches (it is said that more than half of today's computing power is being used to match DNA sequences). Because string matching is so important, there is a lot of research on this, and there are a variety of matching algorithms that are complex. Two matching algorithms are described below.

Before getting involved in a specific algorithm, define several concepts. The target string refers to a string that is matched, long, as a material. A pattern string is a string that is to be matched, short in length, as a tool. The general target string length is always much larger than the pattern string.

Simple matching algorithm

Plain matching is a very simple algorithm, as the name implies. Its basic idea is simple, based on two points:

1. Match the target string from left to right with the pattern string by character

2. End this round match when a mismatch is found and then consider the next character of the character in the current match in the target string

The simple matching algorithm is easy to implement:

 def   naive_matching (t,p): M, n  =< Span style= "color: #000000;" > Len (p), Len (t) I, J  = while i < m and  J < N:  if  P[i] == T[j]: I, j  = i+1, j+1 else  : I, J  = 0, j-i + 1 if i == m:  return  J- I  #这里return的是模式串在目标串中的位置   return -1 

It is easy to see that such an algorithm is less efficient. The main reason for inefficiency is that backtracking occurs during execution. That is, when the pattern string does not match, it moves only one character, starting at the next character of the target string and starting from subscript 0. The worst case scenario is that each match only finds a mismatch until the pattern string is traversed to the last character, and then the match occurs on the last face of the entire target string. For example, the pattern string is 00001, the target string is 00000000000000000000000000001 such a case. For the long m pattern string and N target string, this worst case needs to do n-m+1 wheel comparison, each comparison needs m operations, the overall complexity is m* (n-m+1) is O (m*n) is the square complexity.

The low efficiency of the naïve matching algorithm is based on the fact that each character match is considered as an independent action, and the string itself is not a whole feature. Mathematically, it is assumed that the characters in the target string and the pattern string are completely random, and there are infinitely many possible values, so the comparison of the two-time pattern string to the target string is independent of each other. To improve the naïve algorithm, the KMP algorithm is described below.

KMP algorithm

The KMP algorithm is proposed by Knuth,pratt and Morris, so KMP is actually a person's name.

The basic idea of the KMP algorithm is to obtain certain information in the match of each round pattern string to the target string, which can be used to skip the matching of some wheels to improve the efficiency. For example, look at the following example, the target string is Ababccabcacbab, the pattern string is ABCAC.

After the first round of matching, there can actually be a judgment: in the pattern string of the 2nd bit (note, said Subscript 2, the same) the match failed, and the previous No. 0 and 1th bits are matched, which means that the 1th character is B, and because the No. 0 bit and the 1th character is a different, So there is no need to match the pattern string to the 1th bit of the target string. So the left (1) is unnecessary, just as the KMP process on the right shows, the pattern string moves right two characters to the 2nd bit of the target string. Similarly, in the naïve process (3), (4) is also not necessary, because in the KMP process (1), the pattern string No. 0 to the third-bit exact match, to the fourth bit match failed. Because there is another behind the pattern string itself, in order not to miss the correct match, this time only three characters have been moved to the right (2) condition. Imagine that the pattern string is ABCDC, and the target string is ababcdb ... Then you can move the right 4 characters.

Induction, abstract the above KMP matching method, the focus is to find the previous match in the match failed to the position of the character, and then from the pattern string analysis of some information, combined with the pattern string for the "big step" movement, save some useless.

Then there are some problems. For example, how do I determine how many characters I can move? In addition, the so-called "analysis of some information from the pattern string" is too abstract, how to analyze, what information to get?

To answer these questions, we need to abstract the pattern string and the target string. The target string is defined as "t0t1t2...tj ..." and the pattern string is defined as "P0p1...pi ...". The first thing to be clear is that regardless of the target string, for a fixed pattern string, the number of characters on a particular character in the pattern string is not matched, and its further movement is fixed. This may sound a bit dangling, but I think that when the Pi match fails, it means that all characters p0 to pi-1 have been successfully matched. This means that we can already determine the contents of a subset of the target string. It is not inconceivable that the pattern string can be moved after several characters based on this part of the content. This also gives a clear signal from the algorithm: when a particular character match fails to move backwards a few cells, is the nature of the pattern string itself, with the target string to match, so before the formal match we can calculate the pattern string all position matching failure should move a few characters, This set of data helps to improve the efficiency of formal matching. Let's call this sequence of analysis patterns a pattern string preprocessing. preprocessing should produce a length and pattern string as long as the list Pnext,pnext each item represents the corresponding position of the character Pi match failure, from P0 to pi-1 the length of the substring up to the same prefix (the concept of the maximum same prefix and then detailed said)

The pattern string preprocessing may also encounter some special cases, for example, the first of any pattern string, because the first match fails the previous matching successful string, there is no way to talk about the prefix, so the general rule that its value is-1.

So why do we construct pnext? Look at the picture on this book.

When the pi and TJ matching fails, because the pattern string No. 0 to the i-1 bit and the target string is the same, so the target string can also be written as (1) This form, and then move the pattern string to the right to the next round of matching should need to find a position k, so that when the PK and TJ in the match, The No. 0 bit to the k-1 bit in the pattern string can be exactly the same as the pi-k to pi-1 bit in the target string. And since Pi-k to Pi-1 is a suffix of the pattern string, P0 to Pk-1 is also a prefix of the pattern string (the concept of suffixes and prefixes is s[n:-1]+s[-1] and s[0:n], where n is any integer of [0,len (s)). In this way, the problem of finding K is transformed into the question of determining the length of the two identical prefixes. Obviously, when the K-hour indicates the distance moving farther, the front also said in order not to miss any correct match, the move should be as many times as possible, so when K has multiple values, that is, the pattern string has a number of the largest and the same prefix should take as long as possible (not including p0 to pi-1 itself but including empty strings, itself indicates that no move is made and the empty string indicates that no relevant maximum same prefix substring is found and the p0 is used to match TJ)

How to ask for Pnext

Now the problem turns into how to find out how to pnext or how to find out the maximum of the same prefix for each character in the pattern string when it fails to remove itself and all the characters in front of it.

For a simple pattern string, such as ABABC, we can manually calculate, the No. 0 digit is 1, the 1th is to seek "a" the maximum of the same prefix is obviously 0, 2nd bit is to seek "AB" the largest same prefix, is also 0; 3rd is for "ABA", because there is the same prefix and suffix "a", the length of 1 , so it is 1; the 4th bit is similar to 2. Eventually we get the pnext result [ -1,0,0,1,2]

If you want to get pnext through a function, you can consider solving it by mathematical induction. That

1. When pnext[0] equals-1

2. Assuming that when pnext[i-1] equals k-1, then the prefix and the pattern string itself are counted into the respective next character PK and Pi. If PK==PI, then it is natural that the maximum same prefix is added one character so pnext[i]=k. If they are not equal, it means that the current prefix cannot be the same as the suffix anyway. It's time to step back and try to find a shorter prefix in the prefix to see if you can get the same suffix by the short prefix plus a character. It is important to note that, because the i-1 length is the same as the prefix of the pattern string, when I take the short prefix (that is, prefixes) should be aware that it should also be the suffix suffix (that is, a shorter pi-1 end of the substring) exactly the same. So using this prefix + a character pattern to find the suffix suffix +pi method is correct. This is a bit confusing in the code, and today I think of an afternoon + half night to find the answer in a paper.

You can then get the function that generates PNEXT:

 def   Gen_pnext (p): I, K, m  = 0, -1, Len (p) Pnext  = [-1] * m #   initialize Pnext while  i < M-1:  if  k = =-1 or  p[i] == P[k]: I, K = i + 1, K + 1 Pnext[i]  = K else  : K  = pnext[k] #   return  Pnext 

Passing a pattern string as an argument to this function will give you a list of Pnext for this pattern string, which will help you to match it later. Where a match fails, the query Pnext list gets the K value of that positional character, and then the p[k of the pattern string is aligned with the character of the target string before the failure, making the next round match.

For example, you can apply the above ABCAC example, if it to match the target string ababcabcacbab, the first failure in the 2nd position, its K value is 0, so the No. 0 bit of a in the target string 2nd bit of a, the second match; the second match failed at the 4th bit of the C,k value is 1, the p[ 1] b for the third match on the 6th bit of the target string. The third match has been successful.

Abstract the above process, combined with the pnext generation function can be obtained by the complete KMP algorithm expression:

defMATCHING_KMP (t,p,pnext): J,i=0,0 n,m=Len (t), Len (p) whileJ < N andI <m:ifi = =-1orT[J] = = P[i]:#I=-1 can only be the first character, and P[i] is exactly what was said before, P[k] move to the last match where the error occurred p[k]J,i = j+1,i+1Else: I=Pnext[i]ifi = = m:#if i=m, indicates that all matches are complete        returnJ-Ireturn-1

Look again at the complexity of the KMP algorithm. The first is the function of generating pnext time complexity is O (m), M is the pattern string length. The matching function structure and generation Pnext function are quite similar, its time complexity is O (n), n is the length of the target string. Together, the complexity of the whole MSP algorithm is O (m+n). Because of the m<<n in general, it is approximate to think of the complexity as O (n). After such a large circle, the simple matching O (n**2) was finally reduced to O (n).

Improvements to generate Pnext functions

In the Pnext generation algorithm, the part that sets Pnext[i] can be improved one at a stroke. Because when the match fails, there must be pi! = TJ, if pi = = PK At this time can be explained that PK and TJ do not compare, certainly is different. That is, the next character of the prefix with the largest and the same suffix is analyzed, and if the character that the match fails is the same, then there is no need to move the pattern string to the right Pnext[i] characters, but it is also a match failure, but can be directly right to move pnext[k] characters. This change can move the pattern string farther, potentially improving efficiency (although I don't understand it.) )。 The modified function is as follows:

defgen_pnext2 (P): I, K, M= 0,-1, Len (p) Pnext= [-1] *m whileI < M-1:        ifK = =-1orP[i] = =P[k]: I, K= i + 1, K + 1ifP[i] = = P[k]:#With this judgment, P[i] and p[k] are not necessarily the same content in the prefix, and both the front I and the K have been +=1. So there's this improvement when the two are the same.Pnext[i] =Pnext[k]Else: Pnext[i]=kElse: K=Pnext[k]returnPnext

KMP for scenarios and other algorithms

Many scenarios require a pattern string to repeatedly match multiple target strings, at which point the pnext of the pattern string can be generated once and then reused to improve efficiency. This is the best scenario for the KMP algorithm

Because the implementation does not backtrack, so the KMP algorithm also supports reading in one side of the match, do not go back to reread does not need to save the matched string. It is also well suited for KMP algorithms when dealing with scenes that derive a lot of information from the outside.

There are other algorithms, such as the BM algorithm, which may be much faster than the KMP algorithm in some other scenarios. In short, the string matching algorithm is a brainiac.

Regular expressions

The string match mentioned above is actually just a simple match based on a fixed pattern string. The matching requirements in real-world problems may be far more complex than the matching methods they can provide. In addition, the previous mention of the problem of pattern matching, the previously mentioned substring simple match is actually a special case of pattern matching, and the real pattern matching is often to be matched by a pattern string to get a set of target strings. When the target string is very long, even if there is an infinite number of possibilities, it is necessary to design an effective matching method.

An effective way is to design a pattern language, to express a pattern in the form of a string, and then use this pattern string to match multiple target strings. There have been many researches on schema language, but when the pattern language of design becomes more and more complex, the matching algorithm can only design the algorithm of direct complexity, and the pattern matching problem can be a very expensive and even solvable problem in this algorithm. In other words, the pattern language in this case is worthless. In reality, a meaningful pattern language is a balance between descriptive and processing power.

The regular expression has been tested by practice and has become almost a model language of technical specification. The basic component of a regular expression is also a character, but it is set to divide characters into ordinary characters and characters with special meanings. For ordinary characters, in the regular expression of the surrogate is itself, for special characters, there is a special meaning. If you want to turn a special character into a normal character, you need to add an escape symbol to the regular expression. Regular expressions have the following basic properties:

Ordinary characters in regular expressions match only the character itself

If there are regular expressions α and β, then they can form a combination, "αβ" This regular formula for the sequence of matching, such as α can match the string s,β can match T, then this regular formula can match s+t

α and β also have a choice of combination "α|β". This regular can match either s or T

The regular expression has a wildcard set, that is, with a symbol to represent all possible characters, with the number of matches on the matching of some symbols can match any length, arbitrary content of the relevant characters. Like, ". *" is such a regular type

The specific meaning and usage of some special characters of regular expressions is not much to say, see the Python re module that note, here are some examples of the book to experience. For example, "ABC" can only be matched with "abc", "A (b*) (c*)" may match all the one after the beginning followed by a number of B and followed by a number of C strings, "A ((b|c) *)" can match any one beginning, followed by any number of B and C string.

The implementation algorithm of regular expression

The true regular implementation algorithm is certainly very complex, here is a simplified version of the regular expression, including some of the regular common special symbols and try to use Python to such a simplified regular system implementation.

The special symbols in this simplified version of the regular system include:

. Match any single character

^ Start matching from the beginning of the target string

$ match to end of target string

* A single character before the asterisk can match from 0 to any number of the same characters

A regular example of this regular system: "A*b."; "^ab*c.$"; "AA*BC.*BCA"

Consider a simple regular matching algorithm, give a function match, pass the regular and target string as parameters and return the position of the matched substring in the target string.

defMatch (Re,text): Rlen=Len (re) Tlen=len (text)defMatch_at (re,i,text,j):"""Check Text[j] The beginning of the body is not set to the beginning of the Re[i] pattern matches, the reason for not setting the default is from the head of the re to match is because to leave the interface to handle the asterisk character"""         whileTrue:ifi = = Rlen:#indicates that the pattern string match has been successfully matched until the end                returnTrueifRe[i] = ="$":#if the next character of the character in the current process is $, it must be I and j+=1 after two characters nonalphanumeric come to the respective string end                returnI+1 = = Rlen andj = =TlenifI+1 < Rlen andRE[I+1] = ="*":#if the next character in the pattern string is an asterisk, it is necessary to match the asterisk symbol                returnMatch_star (RE[I],RE,I+2,TEXT,J)#you can see that re[i] is the character that precedes the asterisk, and i+2 is the subscript of the character that follows the asterisk            ifj = = Tlenor(Re[i]! ='.'  andRe[i]! =Text[j]):"""when the J==tlen represents a match to the end of the target string but the pattern string still has the remainder, the match fails when the I bit of the RE is a wildcard character and is not a wildcard. And this mismatch between this location and the target string indicates that the match failed. Need to jump out of function for next round match"""                returnFalse i,j= I+1,j+1defMatch_star (c,re,i,text,j):"""matches the asterisk character, that is, when a number of characters C is skipped in text, the match is checked"""         forNinchRange (J,tlen):ifMatch_at (re,i,text,n):#each scan of an element in a target string checks to see if the starting and partial match of the asterisk is skipped in the pattern string, checking that the end of the target string still matches to return true.                 returnTrueifText[n]! = C andc! ='.':#when found that any one does not start and skips the asterisk partial match, but differs from the given C, C is not a wildcard character, indicating that the match failed                 Break        returnFalseifRe[0] = ="^":        ifMatch_at (re,1,text,0):#since the pattern string starts with ^, it indicates a match from scratch. Reflected in the function I should take 1, let the pattern string starting from the 1th bit to match the target string            return1 forNinchRange (Tlen):#This loop scans the target string, each time the loop body matches a portion of the target string from the beginning of the pattern string, and the target string is gradually reduced. Until a match appears, break the loop return.         ifMatch_at (re,0,text,n):returnNreturn-1

More regular expressions are added, and it's a bit more reasonable to look at it or add it to the re module's instructions. Above.

  

"Data structure" string &KMP substring matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.