Preface: (for a few additional questions, click here to skip>)
As mentioned in the abstract, regular expressions are a huge knowledge system. They are neither a simple metadata table nor a few words clear.
Someone commented, "... if there have been some great things in the history of computer development, the regular expression (Regular Expression) is regarded as one, while web, lisp, hash algorithm, UNIX, relational model, object-oriented objects are also listed here, but there are definitely no more than 20 such objects..."
This may not be enough to attract your attention, because although you have heard of regular expressions, you can also understand regular expressions in the meta-orders table, but you rarely use regular expressions in specific development...
This is indeed the case. Is the regular expression still alive? Where did it go?
The answer is that regular expressions have penetrated into our programming languages, operating systems, and related applications. For example, many advanced Languages provide strings. many operating systems also provide File Content Retrieval commands (such as Linux's grep command) such as find (), which are related to regular expressions.
So, since the regular expression has been "gone" (infiltrated), do we need to learn it? Of course, regular expressions are a technology. Understanding a technology is much more important than mastering a tool.
-------
Directory structure
1. How regular expressions work
2. Regular Expression Engine
3. Regular loop View
4. Backtracking
5. Regular Expression Optimization
6. How to Write efficient real expressions?
7. Several error points
8. Summary
9. Table [metadata table] [pattern controller table] [Special metadata table]
-------
I. Working Principle of Regular Expressions
The process of applying a regular expression to a target string is as follows:
1. Regular Expression Compilation
Check the regular expression syntax. If it is correct, compile it into an internal form.
2. Start Transmission
The drive "locates" the regular engine to the starting position of the target string
P.s. A brief explanation of "transmission" refers to a mechanism inside the Regular Expression Engine. For example, if [ABC] is applied to the string family, the first F attempt fails, next to a in the second place, the match is successful. Note: In this process, who is controlling this kind of "bitwise" Processing (first, attempt second after failure ...)? That's right. It's the so-called drive device.
3. Element Detection
The Regular Expression Engine tries to match the Regular Expression and text, not only by bit forward, but also by backtracking (Backtracking is a key and will be explained in detail later)
4. Obtain matching results
Determine whether the matching result is successful or fails. The specific process is related to the type of the Regular Expression Engine. For example, if the first exact match string is found, the successful result is returned, or find the first qualified string and continue searching. The longest qualified string is returned.
5. Driving Process
If no matching is found at the current position, the drive will drive the engine and start a new round of attempt from the next character at the current position
6. A complete match failure
If the drive engine of the drive device fails to reach the end of the specified string, the matching fails, the reason why the description is so difficult is to be closer to its internal Principles)
Ii. Regular Expression Engine
The so-called RegEx engine type is actually a type of classification. As mentioned earlier, RegEx is a technology that can be used by everyone to solve problems. The ideas for solving problems vary, in other words, the implementation of regular expressions is different, and the rules are different. As a result, after a long period of development, some schools were formed, with different rules applied by different schools.
The following are common genres (regular engine type:
1. NFA ...)
2. DFA
3. POSIX NFA
4. DFA, NFA Hybrid
We don't need to know what the classification standards of each engine are. We just need to understand the differences between each other and the classification of commonly used tools. It's very simple:
1. NFA
Such tools: Java, gun Emacs, grep, DOTNET, PHP, Python, Ruby, etc.
Difference: NFA, which can be called a regular expression-dominated engine, is closely related to the matching efficiency of the regular expression (for example, the order of multiple branch selection in the expression)
2. DFA
Such tools: awk, egrep, flex, Lex, MySQL, procmail, etc.
Difference: DFA is called a text-dominated engine, and its matching efficiency is related to the text (target string) (equivalent but different forms of expressions have the same efficiency, for example, [a-d] and [ABCD]. Note that in NFA, the efficiency of the two is different. Generally, the former is better)
3. POSIX NFA
Such tools: mawk, mortice Kern system's utilities, etc.
Difference: no matter whether the match is successful or not, all possibilities must be tried to find the longest string that can be matched.
4. DFA, NFA Hybrid
Such tools: Gun awk, gun grep/egrep, TCL
Difference: This type of engine should be said to be the best and most mature, and the engine is relatively well optimized. It combines the advantages of DFA and NFA, but currently there are few tools to apply this type of engine.
-------
After talking about this, what we need to know is:
Before using a tool that supports RegEx, you must first know the type of its engine. This is extremely important because different engines have different working mechanisms, for example, the three regular databases in PHP belong to the NFA type, and their matching is closely related to expressions. Therefore, I should optimize the expressions reasonably to Improve the efficiency.
3. Regular lookaround)
[In fact, this item does not need to be listed separately, because it is only a small part of the regular expression, but some people have heard of it but do not know it, I think this is very advanced... so we should discuss it separately (it is absolutely not difficult)]
1. What is "Huanshi "?
Simply understanding Chinese characters, "Huanshi" is to look around, and the regular loop view is actually the truth-driving to a location, first to the left and right to see if this location is the location we are looking
For example, we use (this | that) to match there is a boy lying under that tree.
Obviously, this expression is very inefficient in the NFA engine, and it works like this:
First, when the first t is met, the bitwise check this finds that I and E do not match, the bitwise check that finds that A and E do not match;
Drive forward one bit, to H, check by bit this... check by bit that ...;
Drive one forward, to E ,.......
...
After a lot of useless work, how can we optimize it?
You can extract the prefix (one of the common optimization methods, which is summarized later) and convert it to th (is |)
Of course, here we will discuss how to solve the problem by using the loop view (? = Th) (This | that), oh, the preceding (? =) What should I do if I cannot understand it?
It doesn't matter. This is definitely a sequential view. It means that I walk backward from the beginning, stop when I encounter th, and compare (? = Th) the following expression section -- (this | that) [Note: If TH is not met, it will not stop. Do you want to go backward directly... is the efficiency slightly changed ?]
After optimization, the number of comparisons is significantly reduced. Of course, it seems that you have made a little fuss about the use of the loop view here. We just give a simple example of the application loop view.
2. the type of regularizedview is extremely useful.
Type |
Regular Expression |
Conditions for successful match |
Certainly sequential View |
(? = ...) |
The subexpression can match the right text. |
Certainly reverse view |
(? <= ...) |
......... |
Negative sequential View |
(?!...) |
The subexpression cannot match the right text. |
Negative Reverse View |
(? <!...) |
......... |
P.s. The upper left and right sides refer to the left and right sides of the current position for matching, which is different from the general match. For example:
Use the affirmative sequence to view (? = A) ABC matches the family string. The initial position is in front of F rather than the position of F. Why?
Because the [view structure does not match any character, only match the specific position in the text], if the current position is between F and A, it is sure that the sequential view match is successful, and start to detect ABC by bit.
We found that the sequential view can limit the locations where the comparison was actually started, thus reducing the number of attempts.
3. Applications of Surround View
Loop view is mostly used for expression optimization, and other special occasions (when you do not need loop view, of course, generally, loop view can be replaced by other complex structures)
For example, to match the word the in the Land blongs to these animals, how can we avoid matching the word in these?
We can easily think of word delimiters (if supported by the engine), and use \ BThe \ B for global matching.
In fact, for this example, we can also use (?! \ W) to achieve the goal. Even if the previous one matches the one in these, it does not matter (?! \ W) will exclude these (here the negative order of the ring is limited to the end of E cannot be the letter of the word, specifically \ W is equivalent to [a-zA-Z0-9], it may not be suitable here, but it cannot be explained)
4. backtracking (rollback is definitely a key issue before optimization is mentioned)
Simply put, backtracking is to roll back to a branch that has not been tried (or return to the STANDBY state). Of course, it is easier for people who are not familiar with regular expressions to understand the first statement, the second rule is more accurate)
For a simple example, use .*! To match the string "An idel youth, a needy age! ", An old saying said.
First, the * modifier can match any number of characters (the point number represents any character, * represents any number), and * is the matching priority (that is, * will be as long as possible)
So. * matches the entire string (from A to.), then the detection finds that! What can I do if it cannot be matched?
. * A part of the matched string must be returned! If you have the opportunity to match, return the end of the sentence ,! Still unable to match
Continue to return. This is D and cannot be matched.
...
After age! Returned. matched successfully
During the entire process, the entire string is occupied by. * and is forced to be returned! In the time, the action is backtracking (simply put, the engine driver is going back)
-------
Tracing similar to this is obviously meaningless and time-consuming. A major part of our optimization is to reduce the number of backtracking times.
From another perspective, reducing backtracing improves the matching efficiency, or shortens the time for the engine to start working and feedback matching results (success/failure, isn't it just optimization?
5. Regular Expression Optimization
1. Efficiency Indicators
Evaluate the efficiency of a regular expression. There are two reference indicators: the number of attempts (comparisons) and the number of backtracking times.
On the basis of ensuring expression correctness, the fewer attempts and backtracking times, the better. The fewer attempts, the faster the matching (or the faster feedback)
2. Optimization operations
There are two optimization operations:
A. Accelerate some operations
This should be considered based on the internal implementation of a specific engine. For example, in general, [\ D] is faster than [0-9] In the NFA engine, [0-9] Faster than [0123456789]
B. Avoid redundant operations
That is, the exact restriction. For example, the example of the previously mentioned Regular loop view restricts the starting position of the matching, which can greatly improve the efficiency.
Of course, this kind of Optimization requires trade-offs. If it takes a lot of time to limit the location, but the matching efficiency decreases, such optimization is not desirable.
Do you want to optimize it? To what extent? This requires us to weigh against specific application scenarios.
3. Common Optimization Methods
There are many optimization methods. Here we only list the most commonly used optimization methods (for more information, see related books)
A. eliminate unnecessary parentheses
In many cases, adding () is only to limit the scope of the application twice, rather than to capture and match the text. In this case, the non-capturing parentheses (? :) Replace the captured parentheses (), which not only reduces memory overhead, but also greatly improves efficiency.
B. Remove unnecessary character groups
Some people are used to using character groups such as [.] to represent a single special character. In fact, they can be replaced by \., similar to [*]-> \ * and so on.
C. Avoid repeated Compilation
This is worth noting when using regular expressions in other tools. For example, to use Java to apply a regular expression to a string of text, you must first compile the regular expression, different Regular Expressions only need to be compiled once, so the compiled part should not be placed inside the loop to avoid repeated compilation and save additional time.
D. Use the starting anchor
This is a good habit that should be developed. For example, most. * You can add ^ Or \ A in front of the regular expression to indicate the beginning of a row or paragraph. What are the advantages of this operation?
In some backward engines, this optimization effect is very obvious. Imagine if. * If no matching is found after a round of attempts on the target string, if there is no ^ Or \ A in front of the expression, the engine should start a round of new attempts from the second character position of the target string... of course, it is obvious that this is meaningless (we clearly know that after the end of a round of matching, the matching results will come out, and we don't need 2nd or even n rounds at all)
Some mature engines can automatically optimize such expressions if they are detected. * if there is no ^ Or \ A before the expression, the engine automatically adds the start position sign to the expression to avoid meaningless attempts.
For us, adding a start sign before. * should become a habit.
E. Separate text
For example, [XX *] is faster than [x +], X {3, 5} is not XXXX {0, 2} is faster, th (? : Is | at) ratio (? : This | that) fast
6. How to Write efficient regular expressions?
Follow these steps to write a regular expression:
1. Match the expected text
2. Exclude undesired texts
3. Easy to control and understand
4. ensure efficiency and get results as soon as possible (matching successful/matching failed)
The first two points ensure the correctness of the expression, and the last two points need to make an appropriate trade-off between efficiency and ease of use. This is the principle of writing regular expressions.
Here is a classic example, which can basically describe the general principle-do not drop the child together with the bath water
7. Several error points
1. The difference between [-./] and [.-/] and [./-]
At first glance, there seems to be no difference. In fact, the first and third are equivalent, indicating that the characters at the current position must be hyphens, dots, or slashes.
The second expression is incorrect, indicating that the character at the current position must be any one of all characters from the point sign to the slash (simply put, it is here-it indicates the range, similar to [A-Z]), but it is obvious that the characters between the dot and slash are related to the character set environment. If it is a Unicode Character Set, many strange characters will appear, which is inconsistent with our original intention.
When using-in character groups, you must carefully check the position of-to avoid such errors.
2. ^ differences between [] and internal and external
^ Indicates the beginning of a row, $ indicates the end of the row, and ^ indicates "Non" ([^...] it is called an excluded character group) or a common character ([... ^])
3. [AB] * and (A * | B *)
The two seem equivalent. In fact, there is a special situation: the former can match ABA while the latter cannot. In addition, the former is more efficient.
4. Use quantifiers (? + *) Error-prone points
When there are quantifiers used by nesting, we should carefully explore the semantics to avoid loops (infinite backtracking), for example, using "(\\. | [^ \ "] +) *" to match the continuous double quotation marks in the text. The quotation marks can include double quotation marks escaped with backslashes, which leads to loops, almost never match results
The existence of quantifiers nesting does not necessarily lead to loops. In short, when quantifiers are nested in expressions, they should be treated with caution.
8. Summary
My personal opinion on regular expressions is:
If you do not fully understand the regular expression, try not to use it to solve complicated problems (or try to apply a long regular expression ), some of these traps will make you have to think about it. Constructing a perfect regular expression requires quite careful thinking. In general applications, it is easier to control string matching using programs.
Of course, it is not to say that regular expressions should not be used as much as possible (not to waste food for some reasons). In some cases, regular expressions have an irreplaceable magical effect (for example, extracting URLs from text ...)
Furthermore, we should fully understand Regular Expressions even if we don't need them.
9. Table [metadata table] [pattern controller table] [Special metadata table]
1. Metadata table (metacharacters supported by most tools are provided here)
Metacharacters |
Name |
Description |
^ |
Escape Character |
Indicates the starting position of the row. |
$ |
Dollar sign |
Indicates the end position of the row. |
. |
Period |
Represents any character (generally cannot represent \ n at the end of the line) |
[] |
Character Group |
Any one of the characters in parentheses (one character must be matched) |
[^] |
Excluded character group |
Any character except the characters in brackets (one character must be matched) |
\ Char |
Escape characters |
Represents another meaning of char. For example, \ ^ represents a common character ^ instead of the starting position of the row. |
() |
(Capturing type) Brackets |
Indicates the scope of the quantifiers or captures the matched text (you can obtain the captured text from the reverse reference) |
(? :) |
Non-capturing parentheses |
The function is the same as that of brackets, but the text is not captured. |
? |
Question mark |
Quantifiers, indicating that the left part is dispensable |
* |
Asterisk |
Quantifiers, indicating that there can be any number of quantifiers (or none) on the left) |
+ |
Plus sign |
Quantifiers, indicating that the left part appears at least once, not limited |
{Min, max} |
Interval |
Quantifiers, indicating that the left part appears for at least min and Max times |
{Num} |
Special Interval |
Quantizer, indicating that the left part must appear num times |
| |
Vertical bars |
Or, used to implement multiple-choice Structure |
\ < |
Word Divider |
Start position of a word |
\> |
Word Divider |
End position of a word |
\ Num |
Reverse reference |
Indicates the text captured by the num capturing parentheses (the bracket count is calculated based on the order in which left parentheses appear. Note that nested parentheses are used) |
2. Mode Control Character Table (examples of mode control characters are provided here, which may be different in specific tools)
Control operator |
Description |
I |
Case Insensitive |
G |
Global match: Find all the matching parts in the target text. By default, only the first |
X |
Loose arrangement. Regular Expressions can be dispersed into multiple rows and contain comments. |
M |
The enhanced row anchor mode Splits a paragraph into logical rows so that ^ and $ can match the corresponding positions of each row, rather than the start and end positions of the entire string. |
3. Special metacharacters (Special metacharacters supported by some tools are provided here)
Metacharacters |
Description |
\ D |
Number, equivalent to [0-9] |
\ D |
Non-numeric characters, equivalent to [^ 0-9] |
\ W |
Numbers and letters, equivalent to [a-zA-Z0-9] |
\ W |
Non-numbers and letters, equivalent to [^ a-zA-Z0-9] |
\ S |
Blank characters, such as space characters, tabs, paper characters, carriage returns, and line breaks |
\ S |
Non-blank characters |
\ B |
Word divider, indicating the start or end position of a word |
-------
Statement: all the content above comes from the author's understanding of the content of the reference books.
Reference books: Mastering Regular Expressions (by Jeffrey e.f Friedl)
Book Reviews: This book has a good schedule in the chapter arrangement, interspersed content, and even typographical aspects (Special typographical method: all the questions raised in the book, the answer is displayed only after one page is opened. This is helpful for in-depth understanding of regular expressions. If you are interested, refer