The optimization of regular expression (Sanjiang ferry) _ Regular expression

Source: Internet
Author: User
Tags posix
As with the MySQL comprehensive optimization detailed in the previous writing, it is because such tools are widely used, so a comprehensive optimization strategy for such tools is a good deal, because whether you're a programmer in PHP, Perl, Python, C + +, C #, Java, and so on, You are very likely to use a tool such as MySQL, regular expressions.

Let's say something you may not know. Knowledge of regular expressions, which is useful for our future optimization.
The common grep (global regular expression print) is now the regular origin of the current (from neuroscientists proposing the concept to the mathematician to build the model to be used by IBM is not a large-scale use, to eventually become a grep independent tool to be more used). Now the regular use of POSIX (Portable Operating System interface) is divided into two schools: BREs (base regular expressions) and Eres (extended regular expressions). POSIX programs must support either one. The two have different characteristics to understand. For more details please see the previous article the "Three, common 3-type regular expression comparison" section in the Shell Scripting Learning Guide [a].

The regular matching engine can be divided into two main categories: the DFA and the NFA. The former deterministic finite automata, the latter is non-deterministic finite automata. There are some other wikis that are interesting in the principles of compiling. Now the regular engine is divided into three categories:

1. The DFA engine executes in a linear state because they do not require backtracking (and therefore they never test the same characters two times). The DFA engine can also ensure that the longest possible string is matched. However, because the DFA engine contains only a limited state, it cannot match a pattern with a reverse reference, and because it does not construct a display extension, it cannot catch a subexpression.
2. The traditional NFA engine runs the so-called "greedy" matching backtracking algorithm to test all possible extensions of the regular expression in a specified order and accept the first match. Because the traditional NFA constructs a specific extension of a regular expression to achieve a successful match, it can capture the subexpression matching and matching reverse references. However, because of the traditional NFA backtracking, it can access the exact same state multiple times (if the state is reached through a different path). Therefore, at worst, it can perform very slowly. Because the traditional NFA accepts the first match it finds, it may also cause other (possibly longer) matches to be uncovered.
3. The POSIX NFA engine is similar to the traditional NFA engine, except that they continue to backtrack until they can ensure that the longest possible match is found. As a result, the POSIX NFA engine is slower than the traditional NFA engine, and when using a POSIX NFA, you may not want to support shorter matching searches, rather than longer matching searches, when you change the order of backtracking searches.

Based on the difference between the regular engines, we can summarize two universal rules:
1, preferred to choose the most left side of the matching results.
2, the standard matching classifier (* +? {n,m}) is a priority match.

Here are some simple examples of optimizations:
For example, '. *[0-9][0-9] ' to match the string "ABCD12EFGHIJKLMNOPQRSTUVW", this time the match is '. * ' match the entire line first, but not the two digits after the match, so '. * ' Refund a character ' W ', or unable to match, continues to refund a ' V ', the recycled character to ' 2 ' finds a match, but still cannot match two digits, so continue to refund ' 1 '. This is something we should try to avoid when we understand this feature. If we want to have two digits in a string, just do a two-digit match, do not write '. * ' wildcard characters. To optimize the study is actually to the bottom of the implementation of learning, because optimization is as far as possible to implement the tools to achieve their desired effect, if you do not understand the use of the bottom of the tool, you can not well know what the situation is appropriate with what tools efficient.

From the above introduction to the DFA and NFA, we know the difference between them, simply that the NFA is an expression-driven engine, and the DFA is the text-driven engine. In general: The DFA class engine will only match each character of the target string once, but the NFA will backtrack, and the text-led DFA will be faster than the expression-led NFA.

Careful thinking may notice how avoiding the NFA backtracking as much as possible will be a big problem for us to optimize for the regular NFA engine. There is also a problem of whether to ignore the priority matching, but also need to optimize the point. There is also a simple explanation for priority matching, as it is also mentioned in the universal rules mentioned above. If a match is made to a location that requires an attempt to match or skips a match, the engine prioritizes the attempt and skips the attempt to match the quantifier when it is ignored. For example, how these two points work and why they need to be optimized:

Using AB?C to match ABC, the program flow is similar to this:
First match A This is no problem, then match to B, the engine will because? Number consider to do not match B, the default is quantifier priority, so first make a match attempt, the other choice is placed in the alternate state. This matches AB, then succeeds in matching C, so that the program is over and the alternate state is discarded.

If you still use Ab?c to match AC, the program will first try to match B when it runs to B, and when it does not, it will backtrack, that is, back to the state where the match is good, then the program continues to run the match C, and then it ends successfully. This process is retrospective, the process of learning the algorithm is well understood. is similar to the backward first out of the stack, which always makes it easier to backtrack on the previous legal state.

Again to ignore the priority of the match, such as AB?? C to match the AC, the program first match a, success and then to B??, this time will give up the classifier priority, skip B match first match C, so that the match successfully ended, without the previous backtracking process.

Then look at the unsuccessful match, let Ab?x to match the ABC, you will find that the program matches a, and then try to b,b and then try C,c not go back to mismatched B state try to match X, still cannot match. Then backtrack, and then move the start position from B to try, unsuccessfully and try to start with C so that the final report cannot be matched.

In general, what you write is the one that needs to be careful to avoid backtracking and to determine where you need to circumvent the principle of priority matching two points. The top example is very simple, but avoiding backtracking can reduce the time complexity of the program from Square O (n*n) to a linear O (n), which is, of course, an ideal state.

The backtracking of * and + numbers is similar to the above process, such as x*, which can be considered as x?x?x?x?...... So or (x (x ...)?)? Such Imagine this iterative depth of many layers, suddenly a mismatch of x, which requires a layer of forward backtracking. And that is, if match. *[0-9], such an expression, first this match will match first. *, which allows it to match the entire string, and then backtrack again, the return character to match whether it is a number, in fact, can directly match a number. So the above mentioned. * Such wildcard characters, if not necessary, do not write such wildcard characters.

In addition, the DFA is not support ignore priority, only support matching priority. and. * This greedy trait is easy to ignore, and improper use will result in what we may not need. For example, we want to match the contents of (. *) brackets, the target string is ABCD (AAAA) EFG (GGG) H, which, according to the nature of the. *, will be returned from the match to the first (at the end of the line, at this time, then based on) the need for a character to be refunded in order to match), the problem arises, The result of the final match is (AAAA) EFG (GGG), which is not the result we expect. In fact, the regular expressions we need are ([^ ()]*). This kind of mistake takes place especially carefully in the HTML tag, like <b>123</b>456<b>789</b&gt, if you want to match the replacement, you will be wrong. But you try to use a similar ([^ ()]*) method, please, think about it, and you're going to be wrong. such as <b>123/b><b</b> use <b>[^</b>]</b> it is obvious that there is no match at all. Do you have any idea? Just give up the principle of matching priority is very good to achieve, like this: <br/><b>.*?</b&gt, will give up. * Priority attempts to match, will first match </b> no words to let. * absorbed. You may have found that there is still a problem with this, because for <b>123<b>456</b&gt, this match will still not be what we like, because the match back is <b>123<b>456</b >, and what we expect to get is <b>456</b>. A better solution is to use the regular-looking look-around function, which needs to be understood separately by Google.

The above describes the first attempt and skip the attempt two modes, the use of proper is to help the regular optimization, there is a model is the curing group (?> expression). Specifically, there is no difference between the curing group and the normal match, but if the expression match succeeds, it will solidify the result and discard any alternative states. Look at an example: \w+:, let him try to match HelloWorld, we can see at a glance that this is not matched, because it does not contain a colon, in order to compare the curing match, we still describe the process: first \w will match to the end of the string, and then try to match: number, the obvious d does not match, so/ W need to return the next word Foujian: number to match, R also not, finally back to H or unable to match then report cannot match this result. If you use a solidified grouping mode (? >\w+): To match the HelloWorld matching process: first match to the end of the line, then find that the colon cannot be matched, and the report match is unsuccessful. Because we know \w can't match the symbol, so if \w can match the content, certainly not be a colon, so there is no need to retain the \w generated alternate state so that the matching process to generate backtracking, the curing group can be very good to eliminate these alternative states. If you want to try, make sure your tools are supported by regular curing groupings.

There is also a possessive word priority:? +, *+, + +, {m,n}+. This pattern matching, quantifier will be a priority match, and classifier priority matching is different from this pattern of quantifier matching part will not return, that is, will remove the quantifier matching process generated in the alternative mode.

Multi-structure matching similar to a|b|c, the traditional NFA performs sequential matching, and each branch exhausts all alternative states. This sequence of matching characteristics is able to explore a point of optimization method, that is, the likelihood of matching the success of the situation as far as possible to the front.

It says a lot, most of it is related to NFA, the regular optimization of a lot of work is also for NFA engine. In the precompilation phase, the DFA and NFA convert regular expressions into their own rules for their own algorithms. Only the DFA requires more memory, less and slower than the NFA, but when the execution of the formal match is performed, the DFA is faster than the NFA, and even sometimes your regular expression is poorly written, The NFA will also fall into an awkward situation where the match cannot be completed. But the reason that the NFA remains mainstream is that it can provide functionality that the DFA cannot provide. For example, all the matching modes mentioned above are not provided by the DFA.

The NFA and the DFA are not impossible to coexist with, and some tools have two matching engines to give themselves the high efficiency and NFA versatility of the DFA. GNU grep and awk, for example, use an efficient DFA engine when completing a matching task, as well as using a DFA when accomplishing complex tasks, and switching to an NFA engine if the functionality is not met.

It's a bit confusing. Some knowledge and regular optimization examples of the DFA and NFA regular engines are presented. We also know that the regular formula for the DFA engine does not have too many optimization strategies, or that you are writing regular expressions as accurately as possible and as few matching attempts as possible. We have a larger space for the regular expressions of the NFA engine, but in this case you have to distinguish between the tools you use based on the traditional NFA or the POSIX NFA. Some problems may only exist for one engine, but not too much for the other.

Avoid backtracking, but avoid the backtracking of exponential growth. such as expression ([^/]+) *: Match one character at a time to consider whether it should be a + quantifier or a * classifier, so if you match a string with a length of 10, you need to backtrack 1023 times, the first time not backtracking, this is 2 exponential growth rate, if this string grows to 20, More than 1 million possible, often a few seconds, if it is 30, more than 1 billion of the possible, you have to run for hours, if the string length of more than 40, then you have to wait more than a year. This actually gives us a part of the regular engine that's used to identify the tools we use:
1, if an expression can not match, but also to give results, then it may be a DFA, it is possible.
2, if can match to quickly give the result, that is the traditional NFA.
3, always very slow, that is the POSIX NFA.

The first is only possible, as an NFA with advanced optimizations is able to give results quickly.

Another is that the backtracking costs of multiple-selection structures are high, such as: A|b|c|d|e|f and [A-f], and character arrays [a-f] only need to do simple tests, but the multiple-selection structure will have 6 more alternate states in each position to backtrack when matching.

Many of the regular compilers now do many optimizations that you don't know about, but commonsense optimizations are always good if you notice that the tools you use are optimized for this piece.

1. If your regular tool is supported, use non-capture brackets when you do not need to refer to the bracketed text: (?: expression).
2. If the brackets are not necessary, do not add parentheses.
3. Do not misuse character arrays, such as [.], please use \. 。
4. Use Anchor point ^ $, which accelerates positioning.
5. Extract the necessary elements from two times, such as: x+ written xx*,a{2,4} written aa{0,2}.
6. Extract the same characters at the beginning of the multiple-selection structure, such as the|this to th (?: E|is). (If your regular engine does not support this use, change to th (e|is)), especially the anchor point, must be independent, so many regular compilers will be based on the anchor point for special optimization: ^123|^ABC change to ^ (?: 123|ABC). The same $ is also as independent as possible.
7. An expression behind the multiple-selection structure is placed in a multiple-selection structure, which allows the match to fail faster when matching any one of the multiple-selection structures without exiting the multiple-selection structure. This optimization needs to be used with caution.
8. Ignoring priority matching and priority matching needs to be determined by your situation. If you're unsure, use a match priority, and it's faster than ignoring priority.
9. Splitting large regular expressions into small regular expressions is very helpful to improve efficiency.
10. Simulate the anchor point, use the suitable survey structure to predict the suitable starting match position, if match 12 months, can check first the first character to match firstly: (? =jfmasond) (?: jan| feb|...| DEC). This optimization is used according to the actual situation, and sometimes the overhead of looking around the structure may be greater.
11. In many cases, the use of curing groups and the taking of priority quantifiers can greatly improve speed.
12. Avoid almost endless matches like (this|that) *. The above mentioned (... +) * is similar.
13. If a simple match can shorten the target string significantly, multiple regular matching can be performed, which is very effective after practice.
PS: The wording may be more confusing, the main reference [proficient regular expression (third edition)] (the United States) Jeffrey.e.f.friedl this book, and also read two other people's blog, basically this letter covered.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.