Discussion on the efficiency of regular expression greed, non-greed and backtracking _ regular expressions

Source: Internet
Author: User
Let's start with the literacy. What is a regular expression of greed, what is greed? Or what is a matching priority classifier, and what is ignoring the precedence classifier?
Well, I don't know what the concept is, let's give an example.
Some students want to filter the content between, that is so write regular and program.
Copy Code code as follows:

$str = preg_replace ('%<script>.+?</script>%i ', ', $str);/Not greedy

It seems that there is no problem, but it is not. If
Copy Code code as follows:

$str = ' <script<script>alert (document.cookie) </script>>alert (document.cookie) </script> ';

Then after the above procedure processing, the result is
Copy Code code as follows:

$str = ' <script<script>alert (document.cookie) </script>>alert (document.cookie) </script> ';
$str = preg_replace ('%<script>.+?</script>%i ', ', $str);/Not greedy
Print_r ($STR);
$STR output is <script>alert (document.cookie) </script>

Still not up to the effect he wanted. The above is not greedy, but also some called inertia. Its logo is not greedy labeled as a measure of the number of characters behind add? , such as +?, *?、?? (More special, in the future blog, I will write) and so on. That is, the identity is not greedy, if not written? is greed. Like what
Copy Code code as follows:

$str = ' <script<script>alert (document.cookie) </script>>alert (document.cookie) </script> ';
$str = preg_replace ('%<script>.+</script>%i ', ', $str);/Not greedy
Print_r ($STR);
$STR output for <script only these, as if not quite appropriate, ha, you know how to rewrite the regular?

The above is an introduction to the difference of greed, not greed. Below, talk about the backtracking problem caused by greed, not greed. Let's look at a small example first.
The regular expression is \w* (\d+), and the string is cfc456n, so what is the number of this regular match??

If your answer is 456, then, congratulations, the answer is wrong, the result is not 456, but 6, you know why?

Cfc4n to explain that when the regular engine uses the regular \w* (\d+) to match the string cfc456n, it first uses \w* to match the string cfc456n, first of all, \w* will match all the characters of the string cfc456n, and then give \d+ to match the remaining strings. And the rest is gone. At this time, \w* rules will be reluctant to spit out a character, to \d+ to match, at the same time, before spitting characters, record a point, this point, is used for backtracking points, and then \d+ to match N, found and can not match the success, will again ask \w* again spit a character, \w* It will record a backtracking point again, and then spit out a character. At this time, the result of \w* matching only cfc45, has spit out 6n, \d+ again to match 6, found matching success, will inform the engine, matching success, directly displayed. So, (\d+) The result is 6, not 456.

When the regular expression above is changed to \w*? (\d+) (note that this is not greedy) and that the string is still cfc456n, so what is the regular match of $??
A classmate answer: The result is 456.
Well, yes, right, is 456,cfc4n weak and weak ask, why is 456?
I'm here to explain why it's 456.
Regular expression has a rule, is the quantifier priority match, so \w*? will go first to match the string cfc456, due to \w*? Is it greedy, the regular engine uses an expression \w+? Match only one string at a time, then give control to the back \d+ to match the next character, while recording a point, For when the match is unsuccessful, return here and match again, that is, the backtracking point. Since the \w is the quantifier is *,* 0 to countless times, so, first of all, 0 times, that is, \w*? Match empty, record back point, give control to \d+,\d+ to match the cfc456n of the first character C, and then, match failed, so, then control to give \w*? To match cfc456n's c,\w*? Match C success, because it is greedy, so he only matches one character at a time, record the backtracking point, and then give control to \d+ match F, then, \d+ match F again failed, and then control to \w*?,\w*? Then match C, record the backtracking point (then \w* ? The result of the match is CFC), then give control to \d+,\d+ to match 4, match success, then, because the quantifier is +, is 1 to countless times, so, then match, then match 5, success, then, then match 6, success, and then, continue to match the operation, the next character is N, match failed, at this time, \d+ will control the power of the hand. Since there is no regular expression behind the \d+, the entire regular expression is declared matched and the result is cfc456, where the first group of results is 456. Dear classmate, you understand just the result of the topic, why is 456?

Well, do you know from the above example the greedy, not greedy match principle? Do you understand when you need to use greed, not greed to deal with your string?
Brother Bird's article is about
expressions, Programs for
Copy Code code as follows:

$reg = "/<script>.*?<\/script>/is";
$str = "<script>********</script>"; Length greater than 100014
$ret = Preg_repalce ($reg, "", $str); Returns null

The reason for this is that there is too much backtracking until it causes the stack to run out of space.

Let's look at an example.
String
Copy Code code as follows:

$str = ' <script>123456</script> ';

The regular expression is
Copy Code code as follows:

$strRegex 1 = '%<script>.+<\/script>% ';
$strRegex 2 = '%<script>.+?<\/script>% ';
$strRegex 3 = '%<script> (?:(?! <\/script>).) +<\/script>% ';

These three regular, each will cause a few backtracking??

The answer to see the next chapter PHP regular expression efficiency: backtracking and solidification grouping

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.