Original address: http://iregex.org/blog/regex-optimizing.html
If you want to challenge your own regular expression level to implement some special effects (such as using regular expressions to calculate prime numbers and solve linear equations), efficiency is not a problem.
There are several or dozens of operations, and the difference between optimization and optimization is not big. However, if the written Regular Expressions run millions of times or tens of millions of times, the efficiency will be a big problem. I have summarized several enhanced regular tables.
Experience in operational efficiency (I learned from my work, I learned from reading books, and my own experience), And I Will paste it here. If you have other experience that is not mentioned here, please feel free to give me some advice.
For ease of writing, define two concepts first.
- Mismatching
: The content range matched by the regular expression exceeds the required range. Some texts obviously do not meet the requirements, but are "hit" by the written regular expression ". For example, if/d {11} is used}
To match the 11-digit mobile phone number,/d {11}
It not only matches the correct mobile phone number, but also matches 98765432100
This is obviously not a string of the mobile phone number. We call such a match a false match.
- Miss matching
: The content matched by the regular expression is too narrow in scope. Some texts are required, but the written regular expressions do not include this situation. For example, use/d {18}
To match the 18-digit ID card number, it will miss the end of the letter X.
Write a regular expression, which is both possibleOnly
Mismatching (the condition is extremely loose, and its range is greater than the target text ).Only
Missing matching (only one of the multiple situations in the target text ),Both incorrect matching and missed matching
. For example, use/W +/. com
To match the domain name ending with. com.
Such a string (valid domain name does not contain underscores,/W
Contains the underline), and will miss the ab-c.com
Such a domain name (valid domain names can contain hyphens, but/W
Does not match the hyphen ).
A precise regular expression means that there is no error matching and no missing matching. Of course, in reality, there is such a situation: only a limited number of texts can be seen, according to these text writing rules, but these rules will be used in the sea
. In this case, it is our goal to eliminate Mismatching and missed matching as much as possible (if not completely) and improve the running efficiency. The experience presented in this article is mainly aimed at this situation.
- Grammar details
. In various languages, regular expressions have the same syntax and have different details. Specifying the details of the regular expression syntax in the language is the basis for writing correct and efficient regular expressions. For example, in Perl
The equivalent matching range is [a-zA-Z0-9 _]
The Perl regular expression does not support variable repetition (variable repetition inside lookbehind, for example (? <=. *) ABC
), But the. NET syntax supports this feature. For example, JavaScript is connected to reverse view (lookbehind, such (? <= AB) c
) Is not supported, but Perl and Python are supported. Chapter 1 "characteristics and genre overview of regular expressions" clearly lists the similarities and differences of regular expressions of major factions. This article
The comparison between common languages and tools is also briefly listed. For specific users, you should at least have a detailed understanding of the regular syntax details in the working language in use.
- Rough and refined,
Add and subtract
. The regular expression syntax can be used to describe and define the target text. Like a sketch, the framework is outlined first, and then the details are gradually implemented in the local step. Let's still take the example of the mobile phone number just now, first define/d {11}
, There will be no error; refine it to 1 [358]/d {9}
,
I took a big step forward (as to whether the second place is 3, 5, or 8, I have no intention to go further here. I just give an example to illustrate the process of gradual refinement ). The purpose of this operation is to eliminate the missing matching first (as much as possible at the beginning
Ground match, add), and then eliminate the false match (subtraction) at 1.1 points ). In this way, it is not easy to make mistakes when thinking about it, so as to move toward the goal of "no mistakes, no mistakes.
- Leave room
. We can see that the text sample is limited, and the text to be matched is massive and invisible for the moment. In this case, when writing a regular expression
Jump out of the circles where you can see the text, open up ideas, and make "strategic foresight ". For example, you often receive spam messages such as "Send * votes" and "send # Float ". If you want to write rules to block such annoying spam
Not only do you need to write regular expressions that match the current text and send them to [* #] (? : Ticket | float)
And can also think of sending .(? : Pass | float)
And other possible "Variants ". This may have specific rules in specific fields. In this way, the objective is to eliminate missing matching and prolong the lifecycle of the regular expression.
- Clear
. SpecificallyExercise caution
Use metacharacters like dots,Try
You do not need any quantifiers such as asterisks and plus signs. As long as the range can be determined
Such as/W, do not use the DoT number; as long as the number of repetitions can be predicted, do not use any quantifiers. For example, if you write a script to extract a twitter message, assume the XML Body Structure of a message.
Yes <SPAN class = "MSG">... </Span> and the body contains no angle brackets, <SPAN class = "MSG"> [^ <] {1,480} </span>
This writing methodIdea
Better than <SPAN class = "MSG">. * </span>
For two reasons: first, use [^ <]
It ensures that the text range does not go beyond the location of the next minor sign; second, it specifies the length range, {1,480}
It is based on the length range of characters that a twitter message can contain. Of course, whether the length of 480 is correct can be inferred, but this idea is worth learning. To put it bluntly, "misuse of dots, asterisks, and plus signs is not environmentally friendly and irresponsible ".
- Don't put straw on camels
. Each time you use a common parentheses () instead of a non-capturing parentheses (? :...)
And a portion of the memory will be retained for you to access again. Such a regular expression and unlimited running times are tantamount to a heap of a straw, which can finally crush the camels. Make rational use (? :...) The habit of parentheses.
- Ning Jian Bu fan
. Splitting a complex regular expression into two or more simple regular expressions reduces programming difficulty and improves running efficiency. For example, the regular expression s/^/S + |/S + $/g used to remove blank characters at the beginning and end of a line;
In theory, the running efficiency is lower than S/^/S + // G; S/S + $/g;
. This example is from chapter 5 of "proficient in regular expressions". The comment in the book is "it is almost always the fastest, and obviously the easiest to understand ". Fast and easy to understand. Why not? We have other reasons to consider c = (a | B)
This
The regular expression is split into two expressions, A and B. For example, in the case of a and B, a match is successful if there is a text pattern that can hit the desired value.
(For example, a) will produce incorrect matching, so no matter how high the efficiency of other subexpressions (for example, B) And how accurate the scope, the overall accuracy of C will also be affected by.
- Clever Positioning
. Sometimes, what we need to match is the (with spaces on both sides) of the word, rather than the t-h-e ordered arrangement (for example, togethe
The) in R ). Use ^ when appropriate
, $
,/B
Positioning the anchor effectively improves the efficiency of finding a successful match or removing an unsuccessful match.
Conclusion: Chapter 1 and Chapter 2 of "proficient in regular expressions" have summarized common optimization methods in a more organized manner. However, the general impression of reading is superficial, and I forget it later. If the book has been proved systematically, this feeling is really nice.