In this case, the regular expression optimization, mainly for the current commonly used NFA pattern regular expressions, in detail can be referred to: regular expression matching parsing process analysis (regular expression matching principle). From the above example, we can infer that the influence NFA class regular expression (Common language: GNU emacs,java,ergp,less,more,. NET language,
PCRE Library,perl,php,python,ruby,sed,vi) is essentially its "backtracking", reducing the number of "backtracking" (reducing loops to find the same number of characters), is the main way to improve performance. Let's take a look at an example:
SOURCE string: <script type= "Text/javascript" >adsfadfsdasfsdafdsfsadfsa</script>
Matching requirements, matching <script....>....</script> tags inside all content, including changing labels
Common writing (1), because the <script may appear after the characters, blanks, special symbols and so on, there may also be a variety of tags inside the JS code. Our simple approach is:
Regular Expression:<script.*?>.*?</script> (test tool used: Regexbuddy)
A total of 115 steps, Retrospective: 48 Times. Because we use the "." Characters that match all characters except \ n by default.
Method (2), we analyze the characteristics found behind the,<script...>, should be except ">" can be characters, and then a pair of <script> tags inside the JS content. can be defined in addition to "<". (Here I just illustrate the optimization method, the actual page in the script tag, Common will appear "<" characters)
Regular Expression:<script[^?>]+>[^<]+</script>
19 Steps, 0 times backtracking! , the steps only the original 15% or so, performance several times the promotion!
From the above we see that different regular expressions, the common character Fu Piping, the performance difference will be very large. Reducing backtracking is the best way to reduce backtracking, the main method is: "With the smallest range of meta characters, try to avoid using too large metacharacters!" ”。 The general rules are as follows:
1, using the correct boundary matching (^, $, \b, \b, etc.), limited search string location
2, the use of specific metacharacters, character categories (\d, \w, \s, etc.), less use "." Character
3, the use of the correct quantifier (+, * 、?、 {n,m}), if the length can be limited to match the best
4, using a non-capture group, Atomic group, reduce the need for word matching capture (?:)
such as: I want to match some English letters, it followed by a number. For example: abc1234, I can write "\w+\d+", also can write "[a-za-z]+\d+", where the first \w+ will match all abc1234 first, then backtrack, matching the \d+ format. A total of 4 steps, and the back of this only need 2 steps, the step reduced by half! Well, today is the first here, welcome to discuss, Exchange!