Regular Expression Performance optimization method (efficient regular expression writing) _ Regular expression

Source: Internet
Author: User
Tags script tag

In this case, the regular expression optimization, mainly for the current commonly used NFA pattern regular expressions, in detail can be referred to: regular expression matching parsing process analysis (regular expression matching principle). From the above example, we can infer that the influence NFA class regular expression (Common language: GNU emacs,java,ergp,less,more,. NET language,
PCRE Library,perl,php,python,ruby,sed,vi) is essentially its "backtracking", reducing the number of "backtracking" (reducing loops to find the same number of characters), is the main way to improve performance. Let's take a look at an example:

SOURCE string: <script type= "Text/javascript" >adsfadfsdasfsdafdsfsadfsa</script>

Matching requirements, matching <script....>....</script> tags inside all content, including changing labels

Common writing (1), because the <script may appear after the characters, blanks, special symbols and so on, there may also be a variety of tags inside the JS code. Our simple approach is:

Regular Expression:<script.*?>.*?</script> (test tool used: Regexbuddy)

A total of 115 steps, Retrospective: 48 Times. Because we use the "." Characters that match all characters except \ n by default.
Method (2), we analyze the characteristics found behind the,<script...>, should be except ">" can be characters, and then a pair of <script> tags inside the JS content. can be defined in addition to "<". (Here I just illustrate the optimization method, the actual page in the script tag, Common will appear "<" characters)

Regular Expression:<script[^?>]+>[^<]+</script>

19 Steps, 0 times backtracking! , the steps only the original 15% or so, performance several times the promotion!
From the above we see that different regular expressions, the common character Fu Piping, the performance difference will be very large. Reducing backtracking is the best way to reduce backtracking, the main method is: "With the smallest range of meta characters, try to avoid using too large metacharacters!" ”。 The general rules are as follows:

1, using the correct boundary matching (^, $, \b, \b, etc.), limited search string location
2, the use of specific metacharacters, character categories (\d, \w, \s, etc.), less use "." Character
3, the use of the correct quantifier (+, * 、?、 {n,m}), if the length can be limited to match the best
4, using a non-capture group, Atomic group, reduce the need for word matching capture (?:)

such as: I want to match some English letters, it followed by a number. For example: abc1234, I can write "\w+\d+", also can write "[a-za-z]+\d+", where the first \w+ will match all abc1234 first, then backtrack, matching the \d+ format. A total of 4 steps, and the back of this only need 2 steps, the step reduced by half! Well, today is the first here, welcome to discuss, Exchange!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.