Deep Analysis (pcre) Maximum backtracking/recursion restrictions

Source: Internet
Author: User

Today, Tank asked a question about the following regular expressions:
Copy codeThe Code is as follows:
/<Script> .*? <\/Script>/I

When the length of the string to be matched is greater than 100014, no correct result is obtained:
Copy codeThe Code is as follows:
$ Reg = "/<script> .*? <\/Script>/is ";
$ Str = "<script> ********* </script>"; // The length is greater than 100014.
$ Ret = preg_replace ($ reg, "", $ str); // return NULL

Is there a limit on the length of matched strings in regular expressions?
No, of course not. The reason is that two configuration items are provided in the pcre extension of PHP.
Copy codeThe Code is as follows:
Pcre. backtrack_limit // maximum backend count
Pcre. recursion_limit // maximum number of sets

The default backtarck_limit value is 100000 (0.1 million ).
This problem has something to do with setting the backtrack_limit item. To find out the cause of this problem, the key is "backtracking ".
This regular expression uses the non-Greedy mode. In simple terms, the non-Greedy mode does not match the regular expression. record the alternative status, and assign the matching control to the next matching character of the regular expression. When the subsequent matching fails, repeat and perform matching.
For example:
Copy codeThe Code is as follows:
Source string: aaab
Regular :.*?

When the matching process starts, ". *?" First, obtain the control of matching. Because the mode is not greedy, the matching control is given priority and the matching control is given to the next matching character "B ", "B" fails to match at the source string position 1 ("a"), so it traces back and returns the matching control ". *? ", At this time, ". *?" Match a character "a" and give control to "B" again. After this repetition, the matching result is obtained. a total of three backtracing occurs in this process.
Now let's take a look at the example at the beginning of the article. The default backtrack_limit is 100000, and the source string starts with 9 characters, with a total of 99997 characters.
In addition, because of the logic of the match function, in the example at the beginning of the article, backtracing count increases by 3 (For details, refer to the logical part of the match function in pcrelib/pcre_exec.c ), so before "" is matched, the Backtracking count in pcre is exactly 100000, so the matching is normal and exits.
However, if you add one character, the backend count will be greater than 100000, resulting in a matching failure and exit.
After PHP 5.2, the following features are provided:
Copy codeThe Code is as follows:
Int preg_last_error (void)
Returns the error code of the last PCRE regex execution.

We should always check the return value of this function. When it is not zero, it indicates that the previous regular function has an error. In particular, for the example in the article, an error is returned (PREG_BACKTRACK_LIMIT_ERROR)
Finally, by the way, if the non-Greedy mode leads to too many backtracking operations, there will inevitably be some performance problems. Writing regular expressions appropriately can avoid this problem. for example, modify the regular expression in the example at the beginning of the article:
Copy codeThe Code is as follows:
/<Script> [^ <] * <\/script>/I

It won't lead to so much backtracking ~
Recursion_limit limits the maximum number of RegEx-nested layers. If this value is set to too large, the stack space may be exhausted. The default 100000 may be too large...
For example, for a string with a length of 10000, the following is a seemingly "simple" single regular:
Copy codeThe Code is as follows:
// The default recursion_limit value is 100000.
$ Reg =/(. + ?) +/Is;
$ Str = str_pad ("laruence", 10000, "a"); // The length is 10 thousand.
$ Ret = preg_repalce ($ reg, "", $ str );

It will cause core, because there are too many nesting, resulting in stack explosion.
Of course, you can temporarily solve this problem by modifying the stack size. For example, after modifying the stack space to 20 mb, the above Code will run normally, but this is definitely not the perfect solution. optimize the regular expression.
Finally, regular expressions are easy to use, but difficult to use .. especially when processing large amounts of text data, if the regular expression design is careless, it is easy to cause deep nesting. Considering the performance, we recommend that you use string processing instead of string processing as much as possible.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.