11th Chapter • String

Source: Internet
Author: User

Adt

Definition: A string that refers to a finite sequence of characters from an alphabet.
Data structure: can be implemented by vectors or lists.
Features: Compared to the general linear sequence, the string has more distinct characteristics: its constituent characters are very few, the length of the string is several orders of magnitude higher.

Several terms:

    1. An empty string is any string of substrings, prefixes, and suffixes
    2. Any string is its own substring, prefix, and suffix
    3. substrings, prefixes, and suffixes that are strictly shorter than the original string are also known as true substrings, true prefixes, and true suffixes

As an ADT, its standard interface is as follows

    1. Length is used to get the lengths of the string s
    2. CharAt (i) to get the position of the specified character in the string s
    3. SUBSTR (I,K) for obtaining substrings from I to K in string s
    4. Prefix (k) for obtaining a prefix of length k
    5. Suffix (k) for obtaining a suffix of length k
    6. Concat (t) is used to connect another string T
    7. Equal (t) determines if string s is equal to another string T
    8. IndexOf (P): string match, generalization of equal interface, gets whether another string P is equal to a substring of string s

IndexOf's efficient implementation is the focus of this chapter

String matching

    • What is a string match
      Used to determine whether a string (pattern string p, string length m) is a substring of another string (text string t, string length n).
    • What is the function of string matching?

      1. Determine if M appears in n
      2. The position where M first appeared in n
      3. M has appeared several times in n
      4. Enumeration: where M appears in n
    • How to evaluate performance
      Because of the particularity of string matching, the matching success rate is very low, and it is not suitable to use general stochastic probability evaluation performance. It should be evaluated separately for both success and failure of the match.
      Success: In T, a substring of length m is randomly taken as p; analysis of average complexity
      Failure: Using random p, statistical average complexity

Violent match

Starting from the first position of the text string, the pattern string is compared to the next position of the text string when the pair fails, until the alignment succeeds.

The code for this algorithm is implemented as follows:

Analysis of Complexity:

    • In the best case, the first round is successful and the complexity is O (m).
    • In the worst case, the complexity is O (m* (n-m+1)), and it can be treated as O (n*m) because of N>>m, until the last pair is compared to the last character of the pattern string.
      The smaller the alphabet size or the longer the pattern string, the more likely the worst-case scenario is to occur.
    • When the alphabet is larger, the violent match can reach the efficiency of O (n) in the average case.
KMP algorithm

Let's go back to the brute force algorithm and see why it's inefficient.

is a typical operating process of brute force algorithm. The brute force algorithm needs to start with a series of alignment from the first character until a mismatch occurs at a certain location. The time spent on each iteration is proportional to this prefix.
A particular character in a text string is worst-case compared to every character in a pattern string, compared to a total of M-times.

In front of the worst case analysis of brute force algorithm, we can see that each iteration needs to compare m times to be able to find mismatch in the last position. So you can understand this:

There are many prefixes in the pattern string that can be locally matched to the text string, while the computational cost of brute force algorithms is mainly consumed by these prefixes

However, the majority of these local match prefixes are not necessary, at least not repeated

Memory • Experience • Predictive power

Since these characters have been compared and successful in the previous iteration, we have mastered all of their information. We can make full use of this information to improve the matching efficiency.

Use T[i] and p[j] respectively to represent a pair of characters that are currently accepting a pair.
When the match is made to the last pair of characters, the brute force algorithm synchronizes the two character pointer back: i = I-j +1; j = 0. And then continue the comparison from this position. However, this process is completely unnecessary.

After the previous round of comparison, we already know that the substring t[i-j,i) has been successfully matched, so the next pair, the previous j-1, will also match successfully. Therefore, I can be left unchanged, so that j = j-1, and then continue to compare.

So, the next round only 1 times, a total reduction of j-1 times

This process can be understood as: "to move p relative to T to the right of a cell, and then the previous mismatch position to continue the alignment"
In this way, the information (memory) provided by the previous successful comparison will not only avoid the fallback of the text string character pointer, but also make the pattern string move to the right (experience) as long as possible.

Take another look at the example on the right. When the ratio of t[i] and p[4] mismatch is maintained, the p should be shifted to a few units at the same time.

If a match is achieved in this locality, then at least several characters to the left of the t[i] are matched

In this example, we find that the right shift of 1 or 2 is futile, only the right Shift 3, T[i] The left side of the word characters are matched. So i-1 is the leftmost position that can get such results. In this way, we can move p directly to the right 3 units (the equivalent of I remains unchanged, while the J = 1), and then continue the alignment.

So, how to determine this right-shift distance?

Query table

The first thing to make clear is that this right-shift distance (or t[i with mismatch) is only related to the pattern string, not to the text string.

, the text string is divided into four parts: the prefix, which already matches the successful substring, the mismatch character, and the suffix. Obviously the prefix and suffix have no effect on the right-shift distance, although the substring of the matching success has an effect, but the substring is exactly the same as the substring of the pattern string, so it can be equated with the substring of the pattern string.

On the other hand, the replacement character is more dependent on the pattern string than on the previous mismatch of the p[j]. In a pattern string of length m, P[j] has a maximum of M possible.

The core of the KMP algorithm is that all m cases are processed beforehand and summarized into a query table. Once a mismatch occurs at a location p[j], simply remove the corresponding character from the query table to replace P[J]. Thus, this strategy is not so much the use of powerful memory, rather than in advance for a variety of situations have prepared a sufficient plan.

Based on the previous analysis, the KMP algorithm (version 1) is obtained:

As you can see, this algorithm is roughly the same as the brute force algorithm, except for the Else branch, which is the processing in case of mismatch: KMP takes the character of the next pattern string directly in the query table and t[i it with the text string mismatch. The other difference is a "J < 0" more than the conditions. This is left behind in the analysis.

The principle of next table

The previous analysis allows you to see the power of using the next table. How is this next table produced? What is the principle?
Let's take a look at the "prerequisites" for the pattern string to move right.

Or take a typical mismatch scenario as an example
Text strings and pattern strings in t[i], p[j] mismatch, we move the pattern string to the right, so that p[t] and T[i] continue to match. Then the substring of the new pattern string p[0,t) is bound to match t[i-t,i) (i.e. p[j-t,j). and p[0,t), p[j-t,j) are actually strings of patterns p[0,j (a true prefix, a true suffix).

In other words, this "necessary condition" is: the substring of the pattern string p[0,j) exists p[0,t) = = p[j-t,j). All t that satisfies such a relationship consists of a set of N (p,j).

So, once there is a mismatch, we can take a T directly from N[p,j] and continue the match. The t that is removed is the maximum value of all t in the collection. Previously mentioned in "Memory, experience, and predictability"

In this example, we find that the right shift of 1 or 2 is futile, only the right Shift 3, T[i] The left side of the word characters are matched. So i-1 is the leftmost position that can get this result.

The leftmost position corresponds to the largest t, which is the smallest right-shift distance. The reason for this is to ensure that you do not miss a chance of success. On the other hand, it also shows that all the KMP that have been discarded are proved to be unsuccessful.

Pass-the-match Sentinel

Could it be an empty set in the previous matching collection?
In fact, as long as j>0, then the set n contains at least one element of the t= =0. If j= = 0, then the substring p[0,j) does not exist, that is, the pattern string in the first character that the match fails, this time the set n is empty.

To solve this problem, we introduced a "wildcard Sentinel". We set the first item in the next table to 1. This sentinel is equivalent to adding a dummy character to the left of the pattern string (that is, the position of rank 1), which matches all characters.

    • In the previous version of the KMP algorithm, we added a sentence in the matching success statement: J < 0, which indicates the success of the Sentinel match, so that the text string and pattern string are backward one character.
    • This is actually a case of a match failure, but by adding "Sentinel", the case is logically changed to match success, but the result is exactly the same.

Thus, it is very clever to introduce and set Sentinel in the process of program and algorithm design. KMP is a typical example of this. Generally speaking, the gaoming of this technique is mainly embodied in two aspects: in the implementation of code, the description of the algorithm can be made more concise. Secondly, we can make our understanding of the algorithm more unified and deeper by establishing an imaginary model accordingly. Many well-known physicists, including Galileo, are adept at conducting so-called virtual experiments in the mind. In fact, this illusion pattern in computer science is similar to the virtual experiment in physics.

Structure of next table

The next table's construction process is equivalent to the self-matching of the pattern substring, and its implementation requires only a slight modification of the KMP algorithm.

Performance of the KMP algorithm

Here is a method of "observing variables". In the loop code of the KMP algorithm, add a variable k, which increases synchronously as the number of iterations increases. Therefore, as long as the upper bound of K is determined, we can determine the upper bound of the number of iterations, and we know the upper bound of the complexity.

The K range is O (n), so the complexity of the KMP algorithm can be determined to be O (n). In addition, the construction algorithm of the KMP next table is the same as the KMP algorithm, and its complexity is O (m). Therefore, the complexity of the KMP algorithm is O (n+m) for text strings of length n and for pattern strings of length m.

Improvement of KMP algorithm

Although the KMP algorithm can achieve the efficiency of O (n) in performance, in some cases, there are still obvious flaws.

In this case, the pattern string and the text string comparison failed four times. Besides being necessary for the first time, the other three times are actually unnecessary. Since the first time compared to the discovery of 0 and 1 does not match, then its previous three characters 0 are not matched with 1. We have mastered the information of the pattern string (next table), we should use this information to save unnecessary comparison. In order to solve this problem, the next table should be modified.

The next table is not only the embodiment of the algorithm strategy, but also the specific bearer of the information contained in each pattern string. It concisely depicts the essential characteristics of each pattern string.

It is also simple to modify the next table construction logic: When we determine the next position of the next table, in addition to judging its self-similar characteristics, we should also judge its "non-similarity characteristics": that the new pair of characters should not be the same as the original character.

If the self-similarity feature is the optimization of the pattern string using the previous comparison "experience", then the similarity is the "lesson" obtained from the previous comparison.

The improved version of the next table build algorithm is thus available:

In the previous matching successful branch added conditional judgment, if P[J] and p[t] the same, then N[j] is assigned to N[t].

Finally, we return to the comparison between brute force algorithm and KMP algorithm. Although the brute force algorithm is inefficient, the average efficiency of the brute force algorithm is close to O (n) when the character descriptor is large enough. In other words, the advantages of KMP in this case are not high. In fact, KMP is typically used for binary string comparisons (the character descriptor is 2).

BM algorithm

A string match is a fragment that consists of several characters in a local composition, which is matched by more than one character, and succeeds only if each pair of characters is equal to each other. Conversely, once a pair of characters is found, we can immediately determine the string mismatch. Thus, judging whether a pair of strings is equal in terms of calculating costs is not exactly the same as determining whether they are unequal. The BM algorithm makes full use of this property, so that the efficiency of string matching can be further improved. The algorithm employs two strategies: bad characters and good suffixes.

Bad character policy

KMP will intelligently eliminate a large number of alignment positions, thus greatly saving the cost of the calculation. However, to exclude a certain alignment position, the corresponding success of the comparison is not important, and in this sense, the actual effect is the failure of the comparison. We should expect more of this failure to happen earlier. For example, one of the extremes of this approach is that we may be able to only do these failures in order to rule out the corresponding alignment position.

Our goal is not so much to speed up the match as to accelerate the failure.

Each matching failure gives us a "lesson". However, the value of character mismatches in different locations is different. Obviously, the more the later the character mismatch value is greater. Because we can take this to exclude more alignment positions.

Let's look at an example of this matching strategy:

We found that we should check the location of the "Tao" in the pattern string and move the pattern string to align with the text string when we are mismatch with "Tao". If it does not exist, the pattern string is skipped directly. In the total, we have gone through 8 times more than the length of the text string (12).

BC Table


For a typical process of the BM algorithm, we need to find the nearest character ' X ' in the pattern string prefix and move ' x ' to the mismatch position J. Then the displacement of this process is only related to mismatch position J and ' X ' in the rank of the pattern string. We can build the BC table and calculate all the displacements in advance. As with the KMP algorithm, we also add a "wildcard Sentinel" to the first character of the pattern string to handle cases where the character x does not exist in the pattern string. In addition, if the rank of ' X ' in the pattern string is in the position of J, the displacement is negative, obviously we do not need to move the pattern string to the left, but instead move the pattern string to the right one processing.

The core of the KMP algorithm is the next table, and similarly, the core of the BM algorithm is the BC table and the GS table.

So how to build a BC table?

The build algorithm is simple, just to find the rank of all the characters in the pattern string. If there are multiple identical characters, the last rank is saved. Its spatial complexity is the length of the BC table, which is the letter table length O (s). The time complexity is O (s+m). In fact the first for loop can be omitted, and the time complexity is O (m).

Using the BC table, the BM algorithm can achieve O (n/m) efficiency at best, but the worst-case efficiency is O (n*m).

Good suffix policy

The bad character strategy takes good advantage of the "lessons" of mismatch, but it also requires the "experience" of comparison to be used to fully improve performance. This "experience" is very similar to the idea of the KMP algorithm.

When the pattern string is mismatched in the J position, the substring of the j-m has been successfully matched. Then after the pattern string p is moved, the substring of the p[k,k+m-j should be guaranteed to match the text string suffix, and the new character of the position k is at least not equal to ' Y ' (KMP improved algorithm). This process is also only related to J,p, can be pre-constructed GS table.

GS table

Introduce two concepts: MS table and SS table

MS[J] represents the oldest string that matches the suffix of the pattern string p in all substrings with J as the last character. Compared to the middle J is 8 o'clock, the oldest string ms[8] because "rice". J is 3 o'clock, ms[3] is "ICE". The SS represents the length of the substring corresponding to Ms. It can be known that building the SS table is equivalent to building a GS table.

The construction algorithm is as follows:


Although the GS table algorithm consists of a double loop, the overall time complexity is O (m), since the cumulative execution times of the inner loop do not exceed the range of variable lo (i). (Proof process See Deng Junhui, Data structure exercises analysis (c + + language version), Tsinghua University Press, September 2013, isbn:7-302-33065-3 No. 215 page)

Performance analysis and comparison
    • The spatial complexity of the BM algorithm is O (s+m) (BC Table +gs table)
    • BC, GS Table build time is O (s+m).
    • Find Best Case O (n/m), Worst O (n+m) (similar to KMP)
A comprehensive comparison of string matching algorithms


The vertical axis from low to high indicates that n/m, N, n*m,pr are inversely proportional to the size of the alphabet, that is, 1/s. As you can see, the larger the alphabet, the more linear The brute force matching (BF) efficiency. The smaller the alphabet, the closer the N*m. The KMP algorithm, regardless of the size of the alphabet, is always maintained at a linear level. The BM algorithm uses BC strategy to give full play to the large size of the alphabet, but the size and disadvantage of the alphabet is also obvious. Finally, the BM algorithm combining BC strategy and GS strategy maintains the best efficiency while the worst efficiency does not exceed the linear level.

This shows that the BM algorithm is very suitable for the large-scale alphabet string matching.

Karp-rabin algorithm

Karp-rabin algorithm is a kind of "alternative" string matching algorithm, by converting strings into numbers, directly comparing the numbers. Since the number is a constant time, the overall efficiency can reach O (n).

All Things is Numbers–pythagoras (Pythagoras)
Numbers are the source of nature, and this is a belief that many people uphold. One of the most steadfast believers is also the most skillful practitioner of all. In order to prove the great incompleteness theorem, he invented a simple and powerful numbering method to identify almost all the components of a logical system in natural numbers.
Extended reading: Visiting Leibniz: Collision with Master through time and space

We can prove that any one natural number vector, the only one corresponding to a natural number (fingerprint). And the conversion process is reversible.

So for a string, we can number each character in the character table numerically, so each string can be represented by the natural number of S-binary. For example, each English word corresponds to a 26-binary natural number.

However, when the character list is larger, and the pattern string is longer, the fingerprint of the pattern string p will be large, and its word length may exceed 64 bits, which makes it difficult to store. In addition, when the fingerprint is too large, its calculation and comparison can not be regarded as a constant time.

The workaround required for this is to compress the fingerprint using a hash (introduction of the hash).

Another problem is that hash () calculation requires O (|p|) each time Time, so that the karp-rabin algorithm still needs O (n*m) time, obviously not. Here we use the method of rapid fingerprint calculation: there is a correlation between the adjacent fingerprints, and using this correlation, the next fingerprint can be obtained from the previous fingerprint in O (1) time. Proof process See Deng Junhui, data structure (c + + language version), Third edition, Tsinghua University Press, September 2013, isbn:7-302-33064-6 No. 330 page

Summarize
    1. Firstly, this paper introduces the realization mode and characteristics of the data structure, defines the interface of abstraction, and introduces several related concepts. Because the string structure is relatively simple, we mainly introduce the string matching algorithm.
    2. This paper introduces the realization idea of brute force matching algorithm, and analyzes its efficiency. In the case of smaller characters, the efficiency is lower.
    3. By analyzing the reason of the brute force matching low efficiency, the KMP algorithm is proposed. The next table is used to save the number of pairs and the matching efficiency is linear.
    4. The KMP algorithm is still not the most efficient. In the case of a large character descriptor, brute force matching can also be achieved linearly. The BM algorithm, which is designed according to the character of the large scale of the character descriptor, leads to the probability of match failure, and compares the "lessons" of failure to optimize the efficiency of O (n/m).
    5. This paper introduces the karp-rabin algorithm of matching by another way of thinking, also can achieve linear efficiency.

Note:
The data structure mentioned in the article seems to be missing the relevant content
Add the following:
Analysis of Complexity:
Http://7xt4i9.com1.z0.glb.clouddn.com/16-5-15/3095954.jpg
Fast fingerprint updating algorithm
Http://7xt4i9.com1.z0.glb.clouddn.com/16-5-15/98400059.jpg
Http://7xt4i9.com1.z0.glb.clouddn.com/16-5-15/84329949.jpg

11th Chapter • String

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.