Proficient in regular expressions (regular engine)

Last Update:2018-12-06 Source: Internet

Author: User

Tags posix expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes the usage of some simple metacharacters commonly used in regular expressions. However, if you cannot understand the core of Regular Expression matching, you can never make a breakthrough in this area.

This article focuses on the core of the regular expression-the Regular Expression Engine.

　　3. Regular Expression Engine

The RegEx engine can be divided into two main categories: DFA (deterministic finite automaton) and NFA (deterministic finite automaton ). Both DFA and NFA have a long history, but NFA has a longer history. Tools using NFA include. net, PHP, Ruby, Perl, Python, GNU Emacs, Ed, SEC, Vi, grep, and even some versions of egrep and awk. DFA Tools include egrep, awk, Lex, and flex. Some systems also adopt a hybrid engine, which selects an appropriate engine based on different tasks (or even uses different engines for different parts of the same expression, to obtain the balance between function and speed)

Both NFA and DFA have been developed for many years and many unnecessary variants have been produced. The results are complex. POSIX standards are introduced to regulate this phenomenon. POSIX standards clearly specify the metacharacters and features that should be supported by the engine. Apart from the surface details, DFA has already met the new standards, but the results of the NFA style are different. Therefore, NFA needs to be modified to comply with the standards. In this way, the regular expression engine can be roughly divided into three types: DFA, traditional NFA, and posix nfa.

Let's look at the use of 'to (NITE | knight | night) 'to match the text '... Tonight... . The regular expression starts from the first position of the string to be checked (the position here is not the position of a character, but the middle position of two adjacent characters, each time you check a part (a part of the expression is viewed by the engine), and check whether the "current text" (the character after this position) matches the current part of the expression. If yes, the next part of the expression is continued. If not, the Regular Expression Engine moves the position of a character backward and continues matching until all parts of the expression can match, that is, the entire expression can be matched successfully. In this example, because the first element of the expression is 'T', the regular expression engine will start to repeat the first part of the string to be matched, until you find 'T' in the target character. Then, check whether the character following it can be matched by 'O'. If yes, check the following elements. The following are 'nite ', 'knight', or 'Night '. The engine tries these three possibilities in sequence. The process of trying 'nite 'is the same as before: "try to match 'n', then' I, then 'T', and finally 'E '". If this attempt fails, the engine tries another possibility, and continues until the matching is successful or the report fails. The control of an expression is converted between different elements, so we can call it "expression-dominated ".

Unlike the NFA dominated by expressions, the DFA engine records all matching possibilities of "currently valid" when scanning strings. In this example, the engine will '... Tonight... 'For scanning. When T is scanned, the engine will put a mark on T in the expression, record that the current position can be matched, and then continue scanning o, which can also be matched, after scanning N, we found that there were two matching items (knight was eliminated). When scanning to G, there was only one matching item left. When H and T matched, the engine found that the matching was successful and the report was successful. We call this method "text-dominated" because every character in the string it scans controls the engine.

From the logic of their matching, we can easily find that the text-dominated DFA engine is generally faster. The NFA engine dominated by regular expressions may waste time because different subexpressions need to be matched for the same text. In the NFA matching process, a character in the target text may be repeatedly detected by the regular expression many times (the number of times each character is detected is unknown, so NFA is called an uncertain finite automatic machine ). In contrast, the DFA engine checks each character in the target text only once during the matching process (each character is detected for a relatively fixed number of times, so DFA is called a deterministic finite automatic machine ). There may be hundreds of ways for DFA to get a result, but because DFA can record them at the same time, there is no difference in which expression to choose. That is to say, changing the writing method does not affect the efficiency. While NFA is dominated by expressions, changing the expression writing method may save a lot of effort.

Therefore, the knowledge we will explain later is related to NFA.

　　4. Backtracking

What is backtracking? Let's take a look at an example. We use 'a (B | C) D' to try to match the string "cabb". The Regular Expression Engine is first located before the character 'C' and starts to view the regular expression, it is found that the first one is a, which cannot be matched. Then the engine moves to the position between 'C' and 'A' and continues to view the expression. It finds that a can match and then looks at the end of the expression, if two paths are found, the engine will mark them, select one path, add the selection area to match B, and find that the character 'a' is followed by 'B', which can be matched, and then view the expression again, D needs to be matched, and it is found that the string is followed by 'B', which does not meet the conditions. This path fails, and the engine will automatically return to the selected place. This is called onceBacktracking. Then the engine will try to match the C behind a and find that 'A' is followed by 'B'. This road cannot be connected, there is no other route, and then the engine will move its position, now, between 'A' and 'B', the engine goes back and tries to match expression A. It finds that the current position is followed by 'B', which cannot be matched and the engine starts to move backward, the engine reports a failure until it finds that no matching is successful. If the complete match is completed again in the middle, the engine will automatically stop (conventional NFA will stop, posix nfa will continue, and all possible matches will be completed, select one). The report is successful.

Now we should know that backtracking is actually a situation where the engine has multiple choices in the process of matching strings. When one of the choices cannot be matched, another process is called backtracking. In fact, when we optimize the regular expression, we are trying to minimize the number of backtracking times.

　　4.1 backtrackingMatching priority andIgnore priority

The book "proficient in regular expressions" is called "match first" and "ignore first". Many people on the Internet are called "greedy" and "non-greedy". They are all the same. It doesn't matter if you call it.

We have learned how to match the priority quantifiers? , +, *, And. Matching priority quantifiers first attempt to match when matching. If it fails, backtracing will ignore it. For example, 'AB? "ABB" will be obtained if "ABB" is matched ". After the matching succeeds, the engine has two options: one is to try to match B next to B, and the other is to ignore B next to B, therefore, the engine will try to match B and find that it can match and get "AB". Then the engine again encounters the same problem and will choose to match first, so it gets "ABB ", the engine then reports that the matching is successful if there are no characters in the backend.

Which of the following statements is used to ignore the priority quantifiers? , +, *, {} Are added after? In the format. For example, 'AB ?? If 'Abb 'is matched, 'A' instead of "AB" is obtained ". After the engine matches a successfully, the engine first chooses not to match B because it ignores the priority. Then, it checks the expression and finds that the expression is complete. Then, the engine directly reports that the matching is successful.

Example 1:

            var reg1=/ab?/;            var reg2=/ab??/;            var result1=reg1.exec("abc");            var result2=reg2.exec("abc");            document.write(result1+" "+result2);

Result:

Example 2:

            var reg1=/ab+/;            var reg2=/ab+?/;            var result1=reg1.exec("abbbc");            var result2=reg2.exec("abbbc");            document.write(result1+" "+result2);

Result:

Example 3:

            var reg1=/ab*/;            var reg2=/ab*?/;            var result1=reg1.exec("abbbc");            var result2=reg2.exec("abbbc");            document.write(result1+" "+result2);

Result:

Example 4:

            var reg1=/ab{2,4}/;            var reg2=/ab{2,4}?/;            var result1=reg1.exec("abbbbbbc");            var result2=reg2.exec("abbbbbbc");            document.write(result1+" "+result2);

Result:

Next, let's look at a slightly more complex matching priority, using 'C. * d' match the string "caaadc", and we find that when C matches successfully ,'. * 'will always match the final 'C', and then match d in the expression. If no character can be matched, this will be traced back '. * 'places matching 'C', select '. * 'ignore 'C', then C is left behind, but we find that D cannot be matched, and we have to trace back to the position where D is matched ,'. * 'if you ignore the match again, you can find that the match is D. This is to stop the match and report that the match is successful. Therefore, the result is "caaad ".

Let's look at the ignore priority. Use 'a .*? D' to match the string "caaadc", we found that when the match is successful A, the engine has two paths, will choose to ignore the match, directly match D, however, the of the string "caaadc" is followed by a, so the failure is traced back to the previous selection, and the matching is suspended to get "AA". Then again, the engine encounters the same problem, and the engine chooses to ignore the matching, it is found that it is followed by a, it cannot match D, it is traced back again, select match, and get "AAA". After this ignore match, it is found that the match is successful d, then the report is successful, obtain "AAad ".

I hope these examples can explain these two different situations clearly!

　　4.2 backtracing and curing group　　

Sometimes we don't want the engine to try some backtracing. In this case, we can solve the problem by curing the group -- '(?> ...) '. That is, once the sub-expressions in the brackets match, the matched content will be fixed (atomic) and cannot be changed), and will not change in the next matching process, unless the brackets of the entire curing group are discarded, apply them again in the external backtracking. The following simple example can help us understand the "solidified" nature of this matching.

'!. *! 'Can match "! Hello! ", But if '. *' is in the curing group '!(?>. *)! '. *' Cannot match. In both cases, '. *' will select as many characters as possible and will contain the final '! But the curing group does not return the matching characters, so different results are displayed.

Although this example has no practical value, it is very important to solidify the grouping. In particular, it can improve the matching efficiency and precisely control what can be matched and what cannot be matched. However, JS is not supported. Khan!

　　4.3 backtracking takes precedence

The so-called take precedence quantifiers are * +, ++, and ,? +, {} +, These quantifiers are currently only Java. util. regEx and PCRE (and PHP) are provided, but they are likely to become popular. The priority measure is similar to the common match priority measure. However, once they match some content, they will not "return ". They are similar to curing groups. In a sense, it is only a little effort to take precedence measures, because they can be simulated using curing groups. '. ++' And '(?>. +) 'The result is the same, but it is smart enough to implement more Optimization on the preemptible quantifiers.

　　4.4 backtracking

The view structure does not match any character. It only matches a specific position in the text, which is similar to the word delimiters '\ B', '^', and '$.

'(? =) 'Is called affirmative sequence view, for example, 'x (? = Y) 'indicates matching X. If matching is followed by Y, only X will be remembered and Y will not be remembered.

'(?!) 'Is called a negative sequential view, for example, 'x (?! Y) 'indicates matching X. If y is not followed by X, only X is remembered and Y is not remembered.

Once the standby status inside the loop view is cleared immediately after exiting the loop view range, the external backtracking cannot be traced back to the standby status inside the loop view. Use 'AB \ W + C' and' AB (? = \ W +) C' to match the string "abbbbc". The first expression succeeds, and the second expression fails.

Example 1:

            var reg=/ab(?=c)/;            var result1=reg.exec("abcd");            var result2=reg.exec("abbc");            document.write(result1+" "+result2);

Result:

Example 2:

            var reg=/ab(?!c)/;            var result1=reg.exec("abdc");            var result2=reg.exec("abcd");            document.write(result1+" "+result2);

Result:

Example 3:

            var reg1=/ab\w+bc/;            var reg2=/ab(?=\w+)c/;            var result1=reg1.exec("abbbbbcb");            var result2=reg2.exec("abbbbbbc");            document.write(result1+" "+result2);

Result:

Obviously, you don't think it's easy to explain (change it later ), also, it must be clear that reverse view and Negative Reverse view, take precedence quantifiers, and solidify groups are all solving the problem of backtracking (but JS does not support this now, so it is estimated that the language has to be changed ), backtracking is the culprit affecting expressions! When can I elaborate on these content! It's not that easy to read and write! Sorry!

　　5. Create efficient regular expressions

Perl, Java,. net, Python, PHP, And the JS we are familiar with use of expression-dominated NFA engines. Minor changes may have a major impact on matching results. The problem that does not exist in DFA is very important for NFA. Because the NFA engine allows users to perform precise control, we can create regular expressions with great care.

　　5.1 make a good start

For general text, there are many letters and numbers, but few special characters. A simple change is to change the order of two multiple-choice branches, which may achieve good results. For example, use '(: | \ W) *' and '(\ w | :) *' to match the string "ab13_ B: BBBB: c34d ", generally, a colon appears less than a letter or number in the text. In this example, the efficiency of the first expression is lower than that of the second expression.

Example:

            var reg1=/(:|\w)*/;            var reg2=/(\w|:)*/;            var result1=reg1.exec("ab13_b:bbbb:c34d");            var result2=reg2.exec("ab13_b:bbbb:c34d");            document.write(result1+" "+result2);

　　5.2 Mismatch

For text that cannot be matched, it may still work many times during the matching process. We can improve the speed of error reporting in some way. For example, use '".*"! 'And' "[^"] * "! 'Match the string "the name" McDonald's "is said" makudonarudo "in Japan ". We can see that the first type of backtracking is more than the second type.

　　More than 5.3 structures are expensive

Multiple-choice structure is one of the main reasons for backtracking. For example, use 'U | v | w | x | Y | Z' and '[uvwxyz]' to match the string "the name" McDonald's "is said" makudonarudo "in Japan". In the end, '[uvwxyz]' can be successful only after 34 attempts, if you use 'U | v | w | x | Y | Z', You need to perform 6 backtracing operations at each location. A total of 198 backtracing operations are performed before the same result is obtained.

Use less multiple-choice structure.

　　5.4 eliminate unnecessary parentheses

If some implementation method considers '(? :.) * 'And'. * 'are completely equivalent. Use the latter to replace the former.'. * 'is actually faster.

　　5.5 remove unnecessary character groups

A character group containing only one character is redundant because it must be processed by character group. For example, '[.]' can be written '\.'.

　　5.6 Equivalent Conversion of quantifiers

Some people are used to '\ D \ d', and others are used to the quantifiers' \ D {4 }'. For NFA, there are differences in efficiency, but different tools have different results. If the quantifiers are optimized, '\ D {4}' is faster, unless the regular expression of the quantifiers is not used for more optimization.

　　5.7 use non-capturing parentheses

If you do not need to reference the text in parentheses, use non-capturing parentheses '(? :)'. This not only saves the capture time, but also reduces the number of States used for backtracking. Because the capture needs to use memory, it also reduces the memory usage.

　　5.8 extract required elements

Because many RegEx engines have local optimizations, they mainly rely on the capabilities of the RegEx engine to identify some texts required for successful matching, therefore, manually exposing these texts can improve the possibility of engine recognition. 'Xx * 'replaces 'x +' to expose the required 'x '. '-{2, 4}' can be written '-- {0, 2 }'. Use 'th (? : Is | at) 'replace '(? : This | that) 'to expose the required 'th '.

　　5.9 ignore priority and match priority

Generally, the ignore or match preference is determined by the specific needs of the regular expression. For example, '^. *:' is totally different from '^ .*? : ', Because the former matches the last colon, while the latter matches the first colon. However, if the target data contains only one colon, there is no difference between the two expressions. However, the advantages and disadvantages are not so distinct at all times. The principle is: if the target string is long and you think the colon will be closer to the start of the string, ignore the priority quantizer; if you think that the colon is near the end of the string, you use match first. If the data is random and you do not know where the colon is, match the priority quantifiers, because their optimization is generally better than other quantifiers.

　　5.10 split regular expression

Sometimes, the speed of applying multiple small regular expressions is much faster than that of a large regular expression. For example, if you want to check whether a long string contains the name of a month, check 'january', 'february ', 'march' and so on at a faster speed than 'january |... | .... 'Much faster.

For more optimization methods, see "proficient in regular expressions". Here I just list some easy-to-understand methods. In fact, as long as you understand how the regular engine room matches and understand the logic of backtracking, You can optimize the expressions you write!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More