Regular expression efficiency

Source: Internet
Author: User
Tags repetition

If it is purely to challenge their regular level, to achieve some special effects (such as using regular expressions to calculate prime numbers, solving linear equations), efficiency is not a problem, if the regular expression is written only to meet one or two times, dozens of times of the operation, the difference between optimization is not too big. However, if you write regular expressions that will run millions and thousands of thousands, efficiency is a big problem.

For ease of drafting, two concepts are defined first.
mis-match:The expression matches the range of content beyond the required range, some text clearly does not meet the requirements, but the written regular formula "hit." For example, if you use \d{11} to match a 11-digit phone number, \d{11} not only matches the correct phone number, it also matches 98765432100 such as a string that is obviously not a phone number. We call this match a false match.
Leak matching:The content of the expression is too narrow, and some text is really needed, but the positive does not include it. For example, using \d{18} to match a 18-bit ID number, you would miss the case where the end is the letter x.
Write a regular expression that may have only a false match (the condition is extremely loose, its scope is greater than the target text), or it may only have a leak match (describing only one of the various cases in the target text), and may also be both incorrect and leak matched. For example, using the \w+\.com to match the. com end of the domain name, will not be matched abc_.com such a string (legal domain name is not underlined, \w contains an underscore this case), and will be missing ab-c.com such domain name (legal domain name can contain an underscore, but \ W does not match the dash).
A precise regular expression means an unmistakable match and no leak matching. Of course, there are situations in reality where you can only see a limited amount of text and write rules based on those text, but these rules will be used in huge amounts of text. In this case, it is our goal to eliminate false matches as well as drain matches as much as possible (if not completely) and to improve operational efficiency. The experience presented in this paper is mainly aimed at this situation.
Master grammatical details. Regular expressions in various languages have roughly the same syntax and the details are different. The details of the regular syntax of the language used are the basis for writing correct and efficient regular expressions. For example, the matching range equivalent to \w in Perl is [A-za-z0-9_];perl regular does not support the use of variable repetition in affirmative-reverse surround (variable repetition inside lookbehind, for example (? <=.*) ABC), But. NET syntax supports this feature, and JavaScript is not supported in reverse-order surround (lookbehind, such as (? <=ab) c), while Perl and Python are supported. "Proficient regular expression" in the 3rd chapter "The characteristics and genre of regular expressions" is a clear list of the similarities and differences of the major factions, this article also briefly listed several common languages, tools in the comparison. For specific users, you should at least learn more about the regular syntax details of the working language you are using.
first, after the fine, first add and then minus. Using regular expression syntax to describe and define the target text, you can sketch out the frame, and then step through the details in the local steps. Still give the example of the mobile phone number, the first definition of \d{11}, will not be wrong, and then refine to 1[358]\d{9}, it took a big step forward (as for the second is not 3, 5, 8, there is no intention to delve into, only to give such an example, the gradual refinement of the process). The purpose of this is to eliminate the leak match first (just start with as many matches as possible, do the addition), and then eliminate the mismatch (subtraction) by 1.1 points. So there is a first after, in consideration is not easy to make mistakes, so as to "do not miss" the goal to move forward.
Leave the leeway. The text you can see is limited, and the text to be tested is massive and temporarily invisible. For such a situation, when writing regular expressions, we should jump out of the circle of the text that can be seen, develop ideas and make "strategic foresight". For example, often receive such spam messages: "Send * ticket", "Hair # Drift." If you want to write rules to block such annoying spam messages, not only to write a regular expression that can match the current text [*#] (?: Ticket | drift), but also to be able to think of hair. (?: Tickets | drift | wind) and the like may appear "variants." This may be targeted in specific areas of the rule, not many words. The purpose of this is to eliminate leak matching and extend the life cycle of regular expressions.
clear. In particular, it is prudent to use a meta-character such as a dot, as far as possible without any quantifiers such as asterisks and plus signs. If you can determine the range, such as \w, do not use the dot, as long as you can predict the number of repetitions, do not use any quantifier. For example, write a script to extract Twitter messages, assuming that the XML body part of a message is structured as <span class= "MSG" >...</span> with no angle brackets in the body, then <span class= "MSG" >[^ <]{1,480}</span> this way of thinking is better than <span class= "MSG" >.*</span&gt, for two reasons: one is to use [^<] It guarantees that the range of text does not exceed the position of the next less-than sign, and the explicit length range, {1,480}, is based on a range of character lengths that a Twitter message can approximate. Of course, whether the length of the 480 is correct can also be elaborated, but this kind of thinking is worthy of reference. To be more aggressive, "misuse of dots, asterisks and plus signs is an environmentally friendly and irresponsible practice".
don't let the straw crush the camel. Each time you use an ordinary bracket () instead of a non-capturing bracket (?: ...), a portion of the memory is kept waiting for you to visit again. Such a regular expression, infinite number of runs, is tantamount to a root of the heap of straw, and finally can be crushed to death camel. Develop reasonable use (?: ...) The habit of parentheses.
Ning Jane does not multiply. Splitting a complex regular expression into two or more simple regular expressions will reduce the difficulty of programming and improve the efficiency of the operation. For example, the regular expression s/^\s+|\s+$//g, which is used to eliminate the whitespace characters at the beginning and end of line, is less efficient than s/^\s+//g in theory; s/\s+$//g;. This example comes from the mastery of regular expressions in chapter fifth, in which the commentary is "it is almost always the fastest and obviously the easiest to understand." Quick and easy to understand, why not? At work we have other reasons to put c== (a| b) Such a regular expression is split into two expressions of a and b executed separately. For example, although both A and B are successful matches as long as there is a text pattern in which a hit is required, the overall accuracy of C will also be affected by a if a sub-expression (such as a) produces a false match, regardless of the efficiency of the other sub-expressions (for example, b) and the accuracy of the range.
smart positioning. Sometimes we need to match the that, as the word of the (there are spaces on both sides), rather than as a part of the word t-h-e ordered arrangement (for example, the together in the). In the appropriate time with the ^,$,\b and so on positioning anchor points, can effectively improve the search for successful matching, elimination of unsuccessful matching efficiency.

The above is summed up a few promotion of the regular expression of the efficiency of the experience (work learned, read books learned, their own experience), organized here. If you have other experience which is not mentioned here, please discuss it.

Regular expression efficiency

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.