Efficiency is not a problem if it is purely to challenge your regular level to implement some special effects (for example, using regular expressions to calculate prime numbers, to solve linear equations), and if the regular expression is written just to satisfy one or two or dozens of runs, the difference between optimizations is not too great. However, if you write regular expressions that will run millions of millions of million, efficiency is a big problem.
To facilitate the formulation, two concepts are defined first.
Error Matching:The expression matches the range beyond what is required, and some text does not meet the requirements, but is "hit" by the written regular style. For example, if you use \d{11} to match a 11-bit cell phone number, \D{11 will not only match the correct cell phone number, it can also match 98765432100 such a string that is clearly not a cell phone number. We call this match an error match.
Leak matching:The content that the expression matches is too narrow, and some of the text is exactly what is needed, but the written version does not include it. For example, using \d{18} to match a 18-bit ID number will omit the end of the letter x.
Write a regular expression that may be either just a mismatch (the condition is extremely loose, its scope is greater than the target text), or it may only have a leak match (describing only one of several scenarios in the target text), and may also have both mismatched and missing matches. For example, using \w+\.com to match a. com end of the domain name, will either mistakenly match abc_.com such a string (the legal domain name is not underlined, \w contains the underscore of this situation), and will miss ab-c.com such a domain name (legal domain name can contain the underscore, but \ W does not match the underline).
A precise regular expression means an unmistakable match and no missing match. There is, of course, a situation in the real world where only a limited number of texts can be seen, and rules are written according to those texts, but these rules will be used in large amounts of text. In this case, as far as possible (if not completely) eliminate mismatch and leak matching, and improve the efficiency of the operation, is our goal. The experience presented in this paper is mainly aimed at this situation.
Master the grammatical details. Regular expressions are roughly the same syntax in various languages, with different details. The details of the regular grammar of the language used are the basis for writing the correct and efficient regular expression. For example, the matching range that is equivalent to \w in Perl is that [the A-za-z0-9_];perl regular does not support the use of variable repetitions in a positive reverse-scan (variable repetition inside lookbehind, for example (? <=.*) ABC), But. NET syntax supports this feature, and, for example, JavaScript is not supported by a reverse-looking look (lookbehind, such as (? <=ab) c), and Perl and Python are supported. In the 3rd chapter of the regular expression, the characteristic and genre survey of normal expressions clearly lists the similarities and differences between the major factions, and this article briefly lists several commonly used languages and tools. For specific users, you should at least learn more about the regular grammatical details of the working language in use.
first coarse, then fine, first add and then minus. Using regular expression syntax to describe and define the target text, you can sketch the frame in outline, and then step through the details. Still take just the example of the mobile phone number, first define \D{11}, the total will not be wrong, and then refined to 1[358]\d{9}, took a big step forward (as for the second is not 3, 5, 8, there is no intention to delve into, only to cite such an example, the gradual refinement of the process) The goal is to eliminate the leak match (just start by matching as many additions as possible) and then eliminate the mismatch (subtraction) 1.1 points. This has the first, in the consideration is not easy to make mistakes, so as to "do not miss the leak" this goal.
Leeway. The text sample you can see is finite, but the text to be matched is massive and temporarily invisible. For such a situation, in writing regular expressions to jump out of the text you can see the circle, open up ideas, make "strategic forward-looking." For example, often receive such spam message: "Send * ticket", "Hair # Drift." If you want to write rules to shield such annoying spam messages, not only to be able to write the current text can match the regular expression hair [*#] (?: Tickets | drift), but also to be able to think of hair. (?: tickets, drift, drift, etc.) may appear "variant". This may be in specific areas of targeted rules, not many words. The purpose of this is to eliminate the leak match and prolong the life cycle of the regular expression.
clear. Specifically, it is prudent to use a character such as a dot, as far as possible without the asterisk and the plus sign such as arbitrary quantifiers. As long as you can determine the scope of, such as \w, do not use the point number, as long as you can predict the number of repetitions, do not use any quantifier. For example, to write a script that analyzes a Twitter message, assuming that the XML body part of a message is <span class= "MSG" >...</span> and that there is no angle bracket in the body, then <span class= "MSG" >[^ <]{1,480}</span> this way of thinking better than <span class= "MSG" >.*</span>, for two reasons: first, the use of [^<], It guarantees that the text will not range beyond the location of the next less than number, and that the length range, {1,480}, is based on the length of the character that a Twitter message can approximate. Of course, whether the length of 480 is correct still can be examined, but this kind of thinking is worth drawing lessons from. To say the least, "misuse of dots, asterisks and plus signs is not environmentally friendly and irresponsible".
don't let the straw crush the camel. Each use of a regular bracket () instead of a non-capture bracket (?: ...) preserves part of the memory waiting for you to visit again. Such regular expression, infinite number of times to run, is tantamount to a pile of straw, and finally can crush the camel die. Develop a reasonable use of (?:...) Parentheses of the habit.
Better to be simple than to multiply. Splitting a complex regular expression into two or more simple regular expressions can reduce the difficulty of programming and increase the efficiency of the operation. For example, the regular expression s/^\s+|\s+$//g used to eliminate the blank characters at the beginning and end of the line is theoretically lower than s/^\s+//g; s/\s+$//g;. This example comes from the fifth chapter on mastery of regular expressions, which is "almost always the quickest and apparently easiest to understand". Quick and easy to understand, why not? There are other reasons to c== the work (a| b Such regular expressions are split into two expressions A and B are executed separately. For example, although both A and B have a successful match if they have a text pattern that can hit the desired but if a single subexpression (such as a) produces a mismatch, the overall accuracy of C can be affected by a, regardless of how efficient the other subexpression (e.g. B) is, and how accurate the range is.
cleverly positioned. Sometimes, we need to match the, which is as the word the (both sides have spaces), not as a part of the word t-h-e order (for example, the together in the). In the appropriate time to use the ^,$,\b and so on to locate anchor points, can effectively improve the success of matching, eliminate unsuccessful matching efficiency.
The above is a summary of several promotion of regular expression of the operational efficiency of the experience (work learned, read books, their own experience), sorted out here. If you have other experience and there is no mention here, welcome to the discussion.