Today, Tank asked a question about the following:
Copy Code code as follows:
When the length of the string to be matched is greater than 100014, the correct result is not obtained:
Copy Code code as follows:
$reg = "/<script>.*?<\/script>/is";
$str = "<script>********</script>"; Length greater than 100014
$ret = Preg_replace ($reg, "", $str); Returns null
Is there a length limit for matching strings?
No, of course not, that's why, in PHP's pcre extension, two settings are provided.
Copy Code code as follows:
Pcre.backtrack_limit//Maximum number of backtracking
Pcre.recursion_limit//MAX nested arrays
The default backtarck_limit is 100000 (100,000).
This problem is related to setting item Backtrack_limit. The key to figuring out the cause of this problem now is what is "backtracking."
This regular, the use of non-greedy mode, the principle of non-greedy pattern matching is simply, in the case of matching can also not match, the priority mismatch. Record the alternate state and give the matching control to the next matching character of the regular expression, and then retrace the match when the subsequent match fails.
As an example:
Copy Code code as follows:
SOURCE string: Aaab
Regular:. *?
At the start of the match process, ". *?" First get the matching control, because the greedy mode, so the priority mismatch, the matching control to the next match character "B", "B" at the source string position 1 match failed ("a"), so backtracking, the matching control returned to the ". *?", this time, ". *?" Match a character "a", and again the control to "B", so repeated, and finally get the result of the match, this process occurred 3 times backtracking.
Now let's take a look at the example at the beginning of the article, the default backtrack_limit is 100000, and the source string starts with 9 characters, which is 99,997 characters.
In addition, because the match function's own logic, at the beginning of the article, causes the backtracking count to increase by 3 (you can see the logical part of the match function in pcrelib/pcre_exec.c), so the backtracking count in Pcre is just 100000 before matching to "". So the normal match, exit.
As long as one character is added, the backtracking count is greater than 100000, which causes the matching failure to exit.
After PHP 5.2, it provides:
Copy Code code as follows:
int preg_last_error (void)
Returns the error code of the last PCRE regex execution.
We should often check the return value of this function, when not zero to explain the last regular function error, especially for the article example, error return (PREG_BACKTRACK_LIMIT_ERROR)
Finally, by the way, non-greedy patterns lead to too much backtracking, there must be some performance problems, appropriate to write a regular, can avoid this problem. For example, change the positive in the beginning of the article to:
Copy Code code as follows:
/<script>[^<]*<\/script>/i
will not lead to so many backtracking ~
The Recursion_limit limits the maximum number of regular nesting layers, and if this value is set too large, it may cause the stack to run out of space. The default of 100000 seems a bit too big ...
For example, for a string with a length of 10000, the following Tanzhong that looks like "Jane":
Copy Code code as follows:
Default Recursion_limit is 100000
$reg =/(. +?) +/is;
$str = Str_pad ("Laruence", 10000, "a"); Length of 10,000
$ret = Preg_repalce ($reg, "", $str);
will lead to core, which is because nesting too much leads to burst stacks.
Of course, you can change the size of the stack to temporarily solve the problem, such as modifying the stack space for 20M, the above code can be normal operation, but this is certainly not the most perfect solution. The fundamental way, or the optimization of the regular.
Finally: it is easy, but difficult to use well. Especially in the large amount of text processing, if the design is careless, it can easily lead to depth nesting, in addition to considering the performance, it is recommended to use string processing as much as possible instead of string processing.