1. What is a regular expression?
Basically, a regular expression is a pattern used to describe a certain number of texts. RegEx represents regular express. This article uses <RegEx> to represent a specific regular expression.
A piece of text is the most basic mode. It simply matches the same text.
2. Different Regular Expression Engines
The Regular Expression Engine is a software that can process regular expressions. Generally, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial will focus on Perl 5 engines, which are the most widely used engines. At the same time, we will also mention some differences with other engines. Many modern engines are similar, but not identical. For example, the. NET Regular Expression Library and JDK Regular Expression package.
3. Text symbols
The most basic regular expression is composed of a single text symbol. For example, <A>, it matches the first character "A" in the string ". For example, for the string "Jack is a boy ". "A" after "J" will be matched. The second "A" won't be matched.
The regular expression can also match the second "A", which must be the place where you tell the Regular Expression Engine to start searching from the first match. In the text editor, you can use "find next ". In programming languages, there is a function that enables you to continue searching backward from the previous matched position.
Similarly, <cat> matches "cat" in "about cats and dogs ". This tells the Regular Expression Engine to find a <C>, followed by a <A>, followed by another <t>.
Note that the Regular Expression Engine is case sensitive by default. <Cat> does not match "cat" unless you tell the engine to ignore case sensitivity ".
· Special characters
12 characters are reserved for special purposes. They are:
[] \ ^ $. |? * + ()
These special characters are also called metacharacters.
If you want to use these characters as text characters in a regular expression, you need to use the Backslash "\" to encode them (escape ). For example, if you want to match "1 + 1 = 2", the correct expression is <1 \ + 1 = 2>.
<1 + 1 = 2> is also a valid regular expression. But it does not match "1 + 1 = 2", but will match "123 = 2" in "111 + 234 = 111 ". Because "+" represents a special meaning here (repeated once to multiple times ).
In programming languages, note that some special characters are first processed by the compiler and then passed to the Regular Expression Engine. Therefore, the regular expression <1 \ + 2 = 2> must be written as "1 \ + 1 = 2" in C ++ ". To match "C: \ Temp", you must use a regular expression <c: \ Temp>. In C ++, the regular expression is changed to "C: \ Temp ".
· Non-printable characters
Special character sequences can be used to indicate certain non-printable characters:
<\ T> tab (0x09)
<\ R> represents the carriage return (0x0d)
<\ N> represents a line break (0x0a)
Note that in windows, "\ r \ n" is used to end a line, while "\ n" is used for Unix ".
4. Internal Working Mechanism of the Regular Expression Engine
Knowing how the Regular Expression Engine works helps you quickly understand why a regular expression does not work as expected.
There are two types of engines: The text-directed engine and the RegEx-directed engine. Jeffrey Friedl calls them the DFA and NFA engines. This article talks about the regular expression-oriented engine. This is because some very useful features, such as lazy quantifiers and backreferences, can only be implemented in the regular expression-oriented engine. So it is not surprising that this engine is currently the most popular.
You can easily tell whether the engine is text-oriented or regular-expression-oriented. If reverse references or "inert" quantifiers are implemented, you are sure that the engine you are using is regular-oriented. You can perform the following test: Apply the regular expression <RegEx | RegEx not> to the string "RegEx not ". If the matching result is RegEx, the engine is regular-oriented. If the result is RegEx not, It is text-oriented. Because the regular expression-oriented engine is a "monkey" engine, it will be eager to make an expression and report the first matching it finds.
· The regular expression-oriented engine always returns the leftmost match.
This is a very important point you need to understand: Even if you may find a "better" match in the future, the regular expression-oriented engine always returns the leftmost match.
When <cat> is applied to "he captured a catfish for his cat", the engine first compares <C> and "H", and the result fails. Therefore, the engine fails to compare <C> and "E. Until the fourth character, <C> matches "C ". <A> the fifth character is matched. The sixth character <t> failed to match "p. The engine re-checks the matching from the fifth character. Until 15th characters start, <cat> match "cat" in "catfish", and the Regular Expression Engine eagerly returns the first matching result, and will not continue to look for other better matches.
5. Character Set
Character Set is a character set that is enclosed by a pair of square brackets. Using character sets, you can tell the Regular Expression Engine to match only one of multiple characters. If you want to match a "A" or an "e", use <[AE]>. You can use <gr [AE] Y> to match gray or gray. This is especially useful when you are not sure whether the character you want to search for is in American or English. Conversely, <gr [AE] Y> does not match graay or graey. The character sequence in the character set is irrelevant and the results are the same.
You can use the hyphen "-" to define a character range as a character set. <[0-9]> match a single number between 0 and 9. You can use more than one range. <[0-9a-fa-f]> matches a single hexadecimal number and is case insensitive. You can also combine the range definition with a single character definition. <[0-9a-fxa-fx]> match a hexadecimal number or letter X. Again, the sequence of characters and range definitions does not affect the result.
· Application of character sets
Find a word that may have misspelled characters, such as <Sep [AE] R [AE] te> or <li [Cs] En [Cs] E>.
Find the identifier of the language, <A-Za-Z _] [A-Za-z_0-9] *>. (* Indicates repeated 0 or multiple times)
Find the hexadecimal number of the C style <0 [XX] [A-Fa-f0-9] +>. (+ Indicates repeat once or multiple times)
· Retrieving the inverse Character Set
The character set is reversed when the left square brackets ([) are followed by an angle bracket (^. The result is that the character set matches any character that is not in square brackets. Unlike ".", the anti-character set can match the carriage return line break.
It is important to remember that a character must be matched to the anti-character set. <Q [^ u]> does not mean that Q is matched and no u is followed. It means: match a Q, followed by a character not U. Therefore, it will not match Q in "Iraq", but will match Q in "Iraq is a country" and a space character. In fact, a space character is a part of the match because it is a "not a U character ".
If you only want to match a Q, the condition is that Q is followed by a character that is not U. We can solve this problem by looking forward as described later.
· Metacharacters in character sets
Note that only four characters in the character set have special meanings. They are: "] \ ^ -". "]" Indicates the end of the character set definition; "\" indicates the escape, "^" indicates the inverse, and "-" indicates the range definition. Other common metacharacters are normal characters in the character set definition and do not need to be escaped. For example, to search for asterisks * or plus signs +, you can use <[+ *]>. Of course, if you escape common metacharacters, your regular expression will work well, but this will reduce readability.
In character set definition, to use the Backslash "\" as a character rather than a special character, you need to use another backslash to escape it. <[\ X]> A backslash and an X are matched. "] ^-" Can be escaped by backslash, or placed in a position that cannot be used to their special meaning. We recommend the latter because it increases readability. For example, if the character "^" is placed after the left bracket "[", it uses the text character meaning rather than the inverse meaning. For example, <[x ^]> matches an X or ^. <[] X]> A "]" or "X" is matched ". <[-X]> or <[X-]> match a hyphen (-) or hyphen (X ".
· Character Set abbreviations
Some character sets are very common, so there are some shorthand methods.
<\ D> representative <[0-9]>;
<\ W> represents a word character. This varies with the implementation of regular expressions. Most of the character sets implemented by regular expressions contain <A-Za-z0-9 _]>.
<\ S> indicates "white characters ". This is also related to different implementations. In most implementations, space characters, TAB characters, and carriage return linefeeds <\ r \ n> are included.
Character Set abbreviations can be used within or out of square brackets. <\ S \ D> match a white character followed by a number. <[\ S \ D]> match a single white character or number. <[\ Da-fa-F]> A hexadecimal number is matched.
Abbreviation of the inverse Character Set
<[\ S] >>=< <[^ \ s]>
<[\ W] >>=< <[^ \ W]>
<[\ D] >=< <[^ \ D]>
· Repeated character sets
If you use "? * + "Operator to repeat a character set. You will repeat the entire character set. It is not only the character it matches. The regular expression <[0-9] +> matches 837 and 222.
If you only want to repeat the matched character, you can use backward reference for the purpose. We will talk about backward reference later.
6. Use? * Or +
? : Tells the engine to match the leading character 0 times or once. In fact, it indicates that the leading character is optional.
+: Tell the engine to match the leading character once or multiple times
*: Tells the engine to match the leading character 0 or multiple times
<[A-Za-Z] [A-Za-z0-9] *> matches HTML tags without attributes, and "<" and ">" are text symbols. The first character set matches a letter, and the second character set matches a letter or number.
We seem to be able to use <[A-Za-z0-9] +>. But it will match <1>. However, this regular expression is valid when you know that the string you want to search for does not contain similar invalid tags.
· Duplicate restrictions
Many modern regular expression implementations allow you to define how many times a character is repeated. Lexical: {min, max }. Both min and Max are non-negative integers. If a comma exists and Max is ignored, Max is not restricted. If both comma and Max are ignored, repeat the time in minutes.
Therefore, {0,} is the same as *, and {1,} is the same as +.
You can use <\ B [1-9] [0-9] {3} \ B> to match 1000 ~ A number between 9999 ("\ B" indicates the word boundary ). <\ B [1-9] [0-9] {2, 4} \ B> match a value between 100 and ~ A number between 99999.
· Be greedy
Suppose you want to use a regular expression to match an HTML Tag. You know that the input will be a valid HTML file, so regular expressions do not need to exclude invalid tags. Therefore, if the content is between two angle brackets, it should be an HTML Tag.
Many new users will first consider using regular expressions <. + >>>, they are surprised to find that for the test string, "This is a <em> first </em> test", you may expect to return <em>, then, when the matching continues, return </em>.
But it does not. The regular expression matches "<em> first </em> ". Obviously, this is not the result we want. The reason is that "+" is greedy. That is to say, "+" will cause the Regular Expression Engine to try to repeat leading characters as much as possible. The engine performs backtracking only when this type of repetition causes the entire regular expression to fail to match. That is to say, it will discard the last "repeat" and then process the remaining part of the regular expression.
Like "+", "? * "Repetition is greedy.
· Go deep into the Regular Expression Engine
Let's take a look at how the Regular Expression Engine matches the previous example. The first mark is "<", which is a text symbol. The second symbol is ".", matches the character "E", and "+" can always match other characters until the end of a line. Then the linefeed fails to match ("." does not match the linefeed ). The engine starts to match the next regular expression symbol. That is, try to match "> ". So far, "<. +" has matched "<em> first </em> test ". The engine tries to match ">" with the linefeed and the result fails. The engine traces back. The result is that "<. +" matches "<em> first </em> tes ". Therefore, the engine matches ">" with "T. Obviously, it will still fail. This process continues until "<. +" matches "<em> first </em", ">", and ">. Therefore, the engine finds a match "<em> first </em> ". Remember, the regular expression-oriented engine is "eager", so it will rush to report the first match it finds. Rather than continue tracing, even if there may be better matching, such as "<em> ". Therefore, we can see that due to the greedy nature of "+", the Regular Expression Engine returns a leftmost longest match.
· Replacing greed with laziness
One possible solution for correcting the above problems is to replace greed with "+" inertia. You can follow "+" with a question mark "?" To achieve this. "*", "{}" And "?" This scheme can also be used for repeated representation. Therefore, in the preceding example, we can use "<. +?> ". Let's take a look at the processing process of the Regular Expression Engine.
Again, the regular expression mark "<" matches the first "<" of the string ". The next regular mark is ".". This is a lazy "+" to repeat the previous character. This tells the Regular Expression Engine to repeat the previous character as few as possible. Therefore, the engine matches "." And the character "E", and then matches "M" with ">". The result fails. The engine performs backtracking, which is different from the previous example. Because it is a inertia repetition, the engine expands the inertia repetition rather than reduces, so "<. +" is now extended to "<em ". The engine continues to match the next mark "> ". A successful match is obtained this time. The engine reports "<em>" as a successful match. The entire process is roughly the same.
· An alternative to inertia Scaling
We also have a better alternative. You can use a greedy repeat with an anti-Character Set: "<[^>] +> ". This is a better solution. When the inertia repeat is used, the engine will backtrack each character before finding a successful match. However, you do not need to perform backtracking when using the anti-character set.
The last thing to remember is that this tutorial only talks about the regular expression-oriented engine. The text-oriented engine does not trace back. At the same time, they do not support inert and repetitive operations.
7. Use "." To match almost any character
In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most vulnerable symbols to misuse.
"." Matches a single character without worrying about the character to be matched. The only exception is the newline character. The engine mentioned in this tutorial does not match the new line character by default. Therefore, by default, "." is equivalent to the abbreviation of the character set [^ \ n \ r] (window) or [^ \ n] (UNIX.
This exception is due to historical reasons. Because the regular expression-based tools were used in the early days. They all read a file in one row and apply the regular expression to each row. In these tools, strings do not contain newline characters. Therefore, "." Never matches new line characters.
Modern tools and languages can apply regular expressions to large strings or even entire files. All regular expression implementations discussed in this tutorial provide an option to make "." match all characters, including new line characters. In regexbuddy, editpad pro, powergrep, and other tools, you can simply select "Point Matching newline ". In Perl, the pattern that "." can match a newline is called "Single Line Pattern ". Unfortunately, this is a confusing term. Because there is also the so-called "multiline mode ". The multi-row mode only affects anchor at the beginning and end of the line, while the single-row mode only affects ".".
Other languages and Regular Expression Libraries also use Perl terminology. When using a regular expression class in. NET Framework, you can use a statement similar to the following to activate the single-row mode: RegEx. Match ("string", "RegEx", regexoptions. singleline)
· Conservative use of the "."
Point numbers can be said to be the most powerful metacharacters. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that do not match.
I will give a simple example. Let's see how to match a date in mm/DD/yy format, but we want to allow users to select separators. One solution that will soon come up with is <\ D. \ D. \ D>. It seems that it matches the date "02/12/03 ". The problem is that 02512703 is also considered a valid date.
<\ D [-/.] \ D [-/.] \ D> it looks like a better solution. Remember that the point number is not a metacharacter in a character set. This solution is far from perfect, and it will match "99/99/99 ". <[0-1] \ D [-/.] [0-3] \ D [-/.] \ D> goes further. Even though it matches "19/39/99 ". The degree to which you want your regular expression to be perfect depends on what you want to do. If you want to verify user input, try to be as perfect as possible. If you only want to analyze a known source and we know that there is no error data, it is enough to use a better regular expression to match the characters you want to search.
8. String start and end anchor
The anchor is different from the regular expression symbol. It does not match any character. Instead, they match the positions before or after the characters. "^" Matches the position before the first character of a string. <^ A> match a in the string "ABC. <^ B> it does not match any character in "ABC.
Similarly, $ matches the position behind the last character in the string. Therefore, <C $> matches C in "ABC.
· Anchored applications
When verifying user input in programming languages, it is very important to use the anchor. If you want to verify that your input is an integer, use <^ \ D + $>.
Excessive leading or ending spaces are often displayed in user input. You can use <^ \ s *> and <\ s * $> to match leading or ending spaces.
· Use "^" and "$" as the anchor for starting and ending a row
If you have a string that contains multiple rows. For example, "First line \ n \ rsecond line" (where \ n \ r represents a new line character ). It is often necessary to process each line separately rather than the entire string. Therefore, almost all regular expression engines provide an option to extend the meanings of these two types of anchor. "^" Can match the start position (before F) of the string and the position (between \ n \ R and S) of each new line character ). Similarly, $ matches the end position of the string (after the last E) and the Front of each new line character (between E and \ n \ r ).
In. net, when you use the following code, it will define the anchor to match the front and back of each new line character: RegEx. match ("string", "RegEx", regexoptions. multiline)
Application: String STR = RegEx. Replace (original, "^", ">", regexoptions. multiline) -- inserts ">" at the beginning of each row ".
· Absolute anchoring
<\ A> only matches the start position of the entire string, <\ Z> only matches the end position of the entire string. Even if you use the multiline mode, the <\ A> and <\ Z> do not match the new line.
Even if \ Z and $ match only the end position of the string, there is still an exception. If the string ends with a new line character, \ Z and $ match the position before the new line character, rather than the end of the entire string. This "improvement" is introduced by Perl and followed by many regular expressions, including Java and. net. If the application is <^ [A-Z] + $> to "Joe \ n", the matching result is "Joe" instead of "Joe \ n ".
Here is the continued article <in simple terms, regular expression 2>
Regular Expression 2
9. Word boundary
Metacharacters <\ B> are also the "Anchors" that match locations ". This match is a 0-length match.
Four locations are considered as "Word boundary ":
1) position before the first character of the string (if the first character of the string is a "word character ")
2) position after the last character of the string (if the last character of the string is a "word character ")
3) between a "word character" and "non-word character", the "non-word character" follows the "word character"
4) between a non-word character and a word character, the word character follows the non-word character
"Word character" is a character that can be matched with "\ W", and "non-word character" is a character that can be matched with "\ W. In most regular expression implementations, the word character usually includes <[a-zA-Z0-9 _]>.
For example, <\ B4 \ B> can match a single 4 instead of a larger part. This regular expression does not match 4 in "44.
In other words, it can be said that <\ B> matches the start and end positions of a "letter/Number Sequence.
The inverse set of "Word boundary" is <\ B>. It is located between two "word characters" or between two "non-word characters.
· Go deep into the Regular Expression Engine
Let's take a look at applying the regular expression <\ bis \ B> to the string "this island is beauul ul ". The engine processes the symbol <\ B> first. Because \ B is 0, the position before the first character T will be investigated. Because T is a "word character" and the character before it is a void, \ B matches the word boundary. Then, <I> and the first character "T" fail to match. The matching process continues until the Fifth Space Character matches the fourth character "S" <\ B>. However, the space character does not match <I>. Continue backward, to the sixth character "I", matched with the Fifth Space Character <\ B>, then the <is> matches the Sixth and Seventh characters. However, the eighth character does not match the second "Word boundary", so the match fails. It reaches 13th characters, because it forms a "Word boundary" with the previous space character and matches <is> with "is" at the same time. The engine then tries to match the second <\ B>. Because 15th space characters and "S" form word boundaries, the match is successful. The engine is "Anxious" to return the successful matching result.
10. Selector
"|" In the regular expression indicates selection. You can use the selector to match one of multiple possible regular expressions.
If you want to search for the text "cat" or "dog", you can use <cat | dog>. If you want more options, you only need to expand the list <cat | dog | mouse | fish>.
The selector has the lowest priority in the regular expression, that is, it tells the engine to either match all expressions on the left of the selector or all expressions on the right. You can also use parentheses to limit the range of delimiters. For example, <\ B (cat | dog) \ B> tells the Regular Expression Engine to process (cat | dog) as a regular expression unit.
· Pay attention to the Regular Expression Engine's "eager for expression"
The Regular Expression Engine is eager to stop searching when it finds a valid match. Therefore, under certain conditions, the order of the expressions on both sides of the separator will affect the result. Suppose you want to use a regular expression to search for a list of functions in a programming language: Get, getvalue, set or setvalue. An obvious solution is <GET | getvalue | set | setvalue>. Let's take a look at the result when searching for setvalue.
Because <get> and <getvalue> both failed, and <set> matched successfully. Because the regular expression-oriented engine is "eager", it will return the first successful match, that is, "set", rather than continuing to search for other better matches.
Contrary to our expectation, the regular expression does not match the entire string. There are several possible solutions. First, considering the urgency of the regular engine, we can change the order of options. For example, we use <getvalue | GET | setvalue | set>, in this way, we can first search for the longest match. We can also combine the four options into two options: <get (value )? | Set (value)? >>. Because the repeat of question marks is greedy, setvalue will always be matched before set.
A better solution is to use the word boundary: <\ B (GET | getvalue | set | setvalue) \ B> or <\ B (get (value )? | Set (value )? \ B>. Furthermore, since all the choices have the same ending, we can optimize the regular expression to <\ B (GET | set) (value )? \ B>.
11. Group and backward reference
Place a part of the regular expression in parentheses, And you can group them. Then you can use some regular operations for the entire group, such as repeated operators.
Note that only parentheses () can be used to form a group. "[]" Is used to define character sets. "{}" Is used to define repeated operations.
When "()" is used to define a regular expression group, the Regular Expression Engine will number the matched group in sequence and store it in the cache. When the Group to be matched is referenced backward, it can be referenced by "\ number. <\ 1> reference the first matched Back Reference Group, <\ 2> reference the second group, and so on, <\ n> references the nth group. <\ 0> references the entire matched regular expression itself. Let's look at an example.
Suppose you want to match the start tag and end tag of an HTML Tag, as well as the text in the middle of the tag. For example, <B> This is a test </B>, we need to match <B> and </B> and the text in the middle. We can use the following regular expression: '<([A-Z] [A-Z0-9] *) [^>] *> .*? </\ 1>"
First, "<" matches the first character "<B> ". Then [A-Z] matches B, [A-Z0-9] * will match 0 to multiple alphanumeric characters followed by 0 to multiple non-'>' characters. The ">" of the regular expression matches the ">" of "<B> ". The RegEx engine will perform a inert match on the characters before the end tag until a "</" symbol is encountered. The '\ 1' in the regular expression then represents a reference to the previously matched group' ([A-Z] [A-Z0-9] *) ', in this example, the referenced tag name is "B ". Therefore, the end label to be matched is "</B>"
You can reference the same back-to-Reference Group multiple times. <([a-c]) X \ 1x \ 1> matches "axaxa", "bxbxb", and "cxcxc ". If the group referenced in the form of numbers does not have a valid match, the referenced content is empty.
A back reference cannot be used by itself. <([ABC] \ 1)> is incorrect. Therefore, you cannot use <\ 0> to match a regular expression. It can only be used in replacement operations.
Backward reference cannot be used inside the character set. <(A) [\ 1b]> <\ 1> does not indicate backward reference. <\ 1> can be interpreted as octal transcoding.
Backward reference reduces the engine speed because it needs to store matching groups. If you do not need to back-reference, you can tell the engine not to store a group. Example: <get (? : Value)>. "(" Followed by "? : "Will tell the engine for the group (value), does not store matching values for backward reference.
· Repeated operations and back-Reference
When the repeat operator is used for a group, the back-reference content in the cache is constantly refreshed, and only the last matched content is retained. For example, <([ABC] +) = \ 1> matches "cab = cab", but <([ABC]) + = \ 1> does not. Because ([ABC]) when "C" is matched for the first time, "\ 1" indicates "C", and ([ABC]) matches "A" and "B ". "\ 1" indicates "B", so it matches "cab = B ".
Application: Check duplicate words-when editing text, it is easy to enter duplicate words, such as "". You can use <\ B (\ W +) \ s + \ 1 \ B> to detect duplicate words. To delete the second word, simply replace "\ 1" with the replacement function.
· Group name and reference
In PHP and python, <(? P <Name> group)> to name the group. In this example, lexical? P <Name> is the name of the group. The name is the name of the group. You can use (? P = Name.
. Net naming Group
. NET Framework also supports naming groups. Unfortunately, Microsoft programmers decided to invent their own syntax instead of following the Perl and Python rules. So far, no other regular expressions have been implemented to support the syntax invented by Microsoft.
The following is an example in. Net:
(? <First> group )(? 'Second' Group)
As you can see,. NET provides two lexical methods to create a naming group: one is to use the angle brackets "<>", or use single quotes "'' ". Angle brackets are more convenient to use in strings, and single quotes are more useful in ASP code, because "<>" is used as an HTML Tag in ASP code.
To reference a naming group, use \ K <Name> or \ k'name '.
When you replace a search, you can use "$ {name}" to reference a naming group.
12. Regular Expression matching mode
The Regular Expression Engine discussed in this tutorial supports three matching modes:
</I> make the regular expression case insensitive,
</S> enable "single line mode", that is, the dot "." matches the new line character.
</M> enable "multiline mode", that is, "^" and "$" match the front and back positions of the new line character.
· Enable or disable the regular expression.
If you insert a modifier (? ISM), the modifier only takes effect for the regular expression on the right. (? -I) Disable case sensitivity. You can perform tests quickly. <(? I) Te (? -I) ST> it should match test, but cannot match test or test.
13. Atomic group and prevention of backtracking
In some special cases, backtracing will make the engine extremely inefficient.
Let's take an example: to match such a string, each field in the string is separated by a comma, and the first 12th fields start with P.
We can easily think of such a regular expression <^ (.*?,) {11} p>. This regular expression works well under normal circumstances. However, in extreme cases, if the 12th fields do not start with P, catastrophic backtracking will occur. If the string to be searched is "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 ". First, the regular expression is successfully matched until it contains 12th characters. At this time, the regular expression consumes strings 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ", to the next character, <p> does not match" 12 ". Therefore, the engine performs backtracking. At this time, the regular expression consumes strings 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11 ". To continue the next matching process, the next regular symbol is the dot <.>. You can match the next comma (,). However, <,> does not match "1" in the character "12 ". Matching failed. Continue tracing. As you can imagine, such a backtracking combination is a very large number. Therefore, the engine may crash.
There are several solutions to prevent such huge backtracking:
A simple solution is to make matching as accurate as possible. Use the inverse Character Set instead of the point number. For example, we use the following regular expression <^ ([^, \ r \ n] *,) {11} p> to reduce the number of failed backtracking times to 11.
Another solution is to use an atomic group.
The purpose of an atomic group is to make the regularizedengine fail faster. Therefore, it can effectively prevent massive backtracking. The atomic group syntax is <(?> Regular Expression)>. Located in (?>) All regular expressions are considered as a single regular expression. Once the match fails, the engine will go back to the regular expression section before the atomic group. In the preceding example, an atomic group can be used for table fulfillment <^ (?> (.*?,) {11}) P>. Once 12th fields match, the engine goes back to the beginning of the atomic group <^>.
14. view forward and backward
Perl 5 introduces two powerful regular Syntax: "view forward" and "view backward ". They are also called "zero-length assertions ". They are zero-length (the so-called zero-length means that the regular expression does not consume matched strings ). The difference is that "check before and after" will actually match characters, but they will discard matching and only return matching results: matching or not matching. This is why they are called "assertions ". They do not actually consume characters in strings, but simply assert whether a matching is possible.
Almost all implementations of Regular Expressions discussed in this article support "viewing forward and backward ". The only exception is that JavaScript only supports forward viewing.
· Forward view of positive and negative statements
For example, if we look for a Q, it is not followed by a U. That is to say, either Q is not followed by a character or U is not followed. A solution for viewing forward with a negative expression is <q (?! U)>. The syntax for viewing the Negative Type forward is <(?! View content)>.
The positive view is similar to the negative view: <(? = View content)>.
If there is a group in the "View content" part, a backward reference is also generated. However, the forward view itself does not produce a backward reference, nor is it included in the number of the backward reference. This is because the forward view itself will be discarded, and only the matching results will be retained. If you want to retain the matching result as a backward reference, you can use <(? = (RegEx)> to generate a backward reference.
· View the sequence of positive and negative statements
Backward viewing and forward viewing have the same effect, but in the opposite direction
The syntax for viewing the Negative Type backward is: <(? <! View content)>
The syntax for positive backward viewing is: <(? <= View content)>
We can see that, compared with the forward view, there is a left angle bracket that represents the direction.
Example: <(? <! A) B> it will match "B" without "A" as the leading character ".
It is worth noting that the regular expression "View" is matched from the current string position when viewing forward; the regular expression "View" is matched when viewing backward, and one character is traced back from the current string position, then, match the regular expression "View.
· Go deep into the Regular Expression Engine
Let's look at a simple example.
Set the regular expression <q (?! U)> apply to the string "Iraq ". The first symbol of the regular expression is <q>. As we know, the engine will scan the entire string before <q> match. After the fourth character "Q" is matched, "Q" is followed by a void ). The next regular symbol is to view the forward. The engine noticed that it had entered a section for viewing the regular expression forward. The next regular expression is <u>, which does not match the null character, leading to a failure to match the regular expression in the forward view. Because it is a negative forward view, it means that the entire forward view result is successful. Therefore, the matching result "Q" is returned.
We are applying the same regular expression to "quit ". <Q> "Q" is matched ". The next regular symbol is the <u> of the forward part, which matches the second character "I" in the string ". The engine continues to go to the next character "I ". However, the engine noticed that the forward view has been completed and the forward view has been successful. Therefore, the engine discards the matched string, which causes the engine to roll back to the character "U ".
If you want to check the forward direction, it means that the successful matching of the view Part causes the entire forward view to fail. Therefore, the engine has to perform backtracking. Finally, because there are no other "Q" and <q> matches, the entire match fails.
To ensure that you can clearly understand the implementation of the forward view, let's <q (? = U) I> apply to "quit ". <Q> match "Q" first ". Then, check that the matching "U" is successful and the matching part is discarded. Only matching judgment results are returned. The engine rolls back from the character "I" to "U ". Because the forward query is successful, the engine continues to process the next regular symbol <I>. The result shows that <I> and "U" do not match. Therefore, the matching fails. The matching of the entire regular expression fails because there is no other "Q" in the end.
· Further understanding of the internal mechanism of the Regular Expression Engine
Let's set <(? <= A) B> apply it to thingamabob ". The engine starts to process the regular symbols and the first character in the string. In this example, the Regular Expression Engine returns a character and checks whether "a" is matched. The engine cannot roll back because there are no characters before "T. Therefore, backward viewing fails. The engine continues to go to the next character "H ". Once again, the engine temporarily rolls back a character and checks whether "a" is matched. The result shows a "T ". An error occurred while viewing back.
Looking back, continue to fail, until the regular expression reaches the "M" in the string, so certainly backward view is matched. Because it is zero-length, the current position of the string is still "M ". The next regular expression is <B>, which fails to match "M. The next character is the second "A" in the string ". The engine temporarily rolls back a character and finds that <A> it does not match "M ".
The next character is the first "B" in the string ". When the engine temporarily returns a character, it finds that it is satisfied with the backward view, and <B> matches "B ". Therefore, the entire regular expression is matched. As a result, the regular expression returns the first "B" in the string ".
· Applications viewed forward and backward
Let's look at an example of a 6-character Word containing "cat.
First, we can solve the problem without looking forward and backward, for example:
<Cat \ W {3} | \ wcat \ W {2} | \ W {2} Cat \ w | \ W {3} Cat>
Easy enough! However, this method becomes clumsy when you need to find a word with 6-12 characters including "cat", "dog", or "Mouse.
Let's take a look at the forward view solution. In this example, we have two basic requirements: first, we need a 6-character, and second, the word contains "cat ".
The regular expression that meets the first requirement is <\ B \ W {6} \ B>. The regular expression that meets the second requirement is <\ B \ W * Cat \ W * \ B>.
By combining the two, we can get the following regular expression:
<(? = \ B \ W {6} \ B) \ B \ W * Cat \ W * \ B>
The specific matching process is left to the reader. However, it is important to note that forward viewing does not consume characters. Therefore, when a word is judged to meet the Six-character condition, the engine will continue to match the regular expression from the beginning.
Finally, we can get the following regular expression:
<\ B (? = \ W {6} \ B) \ W {0, 3} Cat \ W *>
15. Conditional test in Regular Expressions
The condition test syntax is <(? Ifthen | else)>. The "if" part can be a forward and backward view expression. If you use forward view, the syntax is changed to: <(? (? = RegEx) then | else)>. The else part is optional.
If the if part is true, the Regular Expression Engine tries to match the then part; otherwise, the engine tries to match the else part.
It should be noted that the forward view does not actually consume any characters, so the matching between then and else starts from the part before the if test.
16. Add comments to Regular Expressions
The syntax for adding comments to a regular expression is: <(? # Comment)>
For example, add a comment for a regular expression used to match a valid date:
(? # Year) (19 | 20) \ D [-/.] (? # Month) (0 [1-9] | 1 [012]) [-/.] (? # Day) (0 [1-9] | [12] [0-9] | 3 [01])