Regular expressions of the regular expression

Source: Internet
Author: User
Tags html tags modifier repetition alphanumeric characters expression engine

Objective:
Six months ago to the regular expression of interest, found a lot of information on the Internet, read a lot of tutorials, and finally in the use of a regular expression tool Regexbuddy to find his tutorial written very well, can be said to have seen the best regular expression of the tutorial. So I always wanted to translate him. This article is a regexbuddy written by Goyvaerts for the translation of the tutorial, copyright belongs to the original author all, welcomed the reprint. But in order to respect the work of the original author and the translator, please indicate the source. Thank you.
1. What is a regular expression
Basically, a regular expression is a pattern used to describe a certain amount of text. The regex represents regular Express. This article will use <<regex>> to represent a specific regular expression. A piece of text is the most basic pattern, simple to match the same text.
2. Different regular expression engines
The regular expression engine is a software that can handle regular expressions. Typically, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial focuses on the Perl 5-type engine, which is the most widely used engine. We will also mention some differences from other engines. Many modern engines are similar, but not exactly the same. For example. NET regular library, JDK regular package.
3. Text symbols
The most basic regular expression consists of a single literal symbol. such as <<a>&gt, which matches the first occurrence of the character "a" in the string. such as the string "Jack is a boy". "A" after "J" will be matched. And the second "a" will not be matched. The regular expression can also match the second "a", which must be you telling the regular expression engine to start the search from the first match. In a text editor, you can use "Find Next". In a programming language, there is a function that allows you to continue searching backwards in the same position as the previous match. Similar,<<cat>> will match "cat" in "About cats and dogs". This is tantamount to telling the regular expression engine to find a <<c>&gt, followed by a <<a>&gt, and a <<t>>. Note that the regular expression engine defaults to case sensitive. <<cat>> does not match "cat" unless you tell the engine to ignore the case.
· Special Characters
For literal characters, 11 characters are reserved for special purposes. They are: [] ^ $. | ? * + () These special characters are also called Meta characters.
If you want these characters to be used as text characters in regular expressions, you need to use the backslash "\" To change the Code (escape). For example you want to match "1+1=2", the correct expression is <<1\+1=2>> Note that,<<1+1=2>> is also a valid regular expression. However, it does not match "1+1=2" but matches "111=2" in "123+111=234". Because "+" here represents a special meaning (repeat 1 times to many times). In programming languages, note that some special characters are processed by the compiler before being passed to the regular engine. So regular expression <<1\+2=2>> in C + + should be written "1\\+1=2". In order to match "C:\temp", you need to use regular expression <<C:\\temp>>. In C + +, the regular expression becomes "c:\\\\temp".
• Non-display characters
You can use a special sequence of characters to represent some characters that are not displayed:
<<\t>> representative Tab (0X09)
<<\r>> on behalf of carriage return (0x0D)
<<\n>> stands for line breaks (0x0A)
Note that the Windows Chinese document uses "\ r \ n" to end a line and UNIX uses "\ n".
4. The internal working mechanism of the regular expression engine
Knowing how the regular expression engine works can help you quickly understand why a regular expression doesn't work as you expect. There are two types of engines: text-oriented (text-directed) engines and regular-oriented (regex-directed) engines. Jeffrey Friedl called them the DFA and NFA engines. This article is about a regular-oriented engine. This is because of some very useful features, such as "lazy" quantifiers (lazy quantifiers) and reverse references (backreferences), which can only be implemented in a regular-oriented engine. So it's no surprise that this engine is the most popular engine at the moment. You can easily tell whether the engine you are using is text-oriented or regular-oriented. If the reverse reference or "lazy" classifier is implemented, you can be sure that the engine you are using is regular oriented. You can test the following: Apply the regular expression <<regex|regex not>> to the string "Regex not". If the result of the match is a regex, the engine is regular oriented. If the result is a regex not, it is text-oriented. Because the regular-oriented engine is "monkey rush", it will be eager to biaogong and report the first match it finds.
• The regular-oriented engine always returns the leftmost match
This is a very important point to understand: even if it is possible to find a "better" match later, the regular-oriented engine always returns the leftmost match. When the <<cat>> is applied to "he captured a catfish to his cat", the engine first compares <<c>> and "H", and the results fail. The engine then compares <<c>> and "E" and fails. Until the fourth character,<<c>> matches "C". <<a>> matches the fifth character. To the sixth character <<t>> failed to match "P". The engine then continues to re-examine the match from the fifth character. Until the start of the 15th character,<<cat>> matches "cat" in "Catfish", the regular expression engine eagerly returns the first matching result without continuing to look for a better match.
5. Character Set
A character set is a set of characters enclosed by a pair of brackets "[]". With character sets, you can tell the regular expression engine to match only one of several characters. If you want to match a "a" or an "E", use <<[ae]>>. You can use <<gr[ae]y>> to match gray or grey. This is especially useful when you're not sure if the character you're searching for is in American or British English. Conversely,<<gr[ae]y>> will not match graay or Graey.  The order of characters in the character set does not matter, and the results are the same. You can use the hyphen "-" to define a character range as a character set. <<[0-9]>> matches a single number from 0 to 9. You can use more than one range. &LT;&LT;[0-9A-FA-F] >> matches a single hexadecimal number and is not case sensitive. You can also combine a range definition with a single character definition. <<[0-9a-fxA-FX]>> match a hexadecimal digit or letter x. Again, the sequence of character and range definitions has no effect on the result.
• Some applications of character sets
Find a word that may have misspelled words, such as <<sep[ae]r[ae]te>> or <<li[cs]en[cs]e>>.
The identifier,<<a-za-z_][a-za-z_0-9]*>> for the lookup program language. (* means repeat 0 or more times)
Find the C-style hexadecimal number <<0[xX][A-Fa-f0-9]+>>. (+ means repeat one or more times)
• Take the anti-character set
After the opening parenthesis "[" followed by an angle bracket "^", the character set is reversed.) The result is that the character set matches any character that is not in square brackets. Unlike ".", the inverse character set can match the return line feed character. The important thing to remember is that you have to match one character to the counter character set. <<q[^u]>> does not mean: match a Q, followed by no u followed. It means: match a Q, followed by a character that is not U. So it will not match the Q in "Iraq", but will match the Q and a spaces in "Iraq is a country". In fact, spaces is part of the match because it is a "not a U character". If you only want to match a Q, the condition is that there is a character that is not u after Q, and we can use the forward view to solve it later.
• Meta characters in the character set
It is important to note that only 4 characters in the character set have special meaning. They are: "] ^-". "]" represents the end of the character set definition, "\" represents the escape, "^" represents the inverse; "-" represents the scope definition. Other common metacharacters are normal characters inside the character set definition and do not need to be escaped. For example, to search for an asterisk * or Plus +, you can use <<[+*]>>. Of course, if you escape from the usual metacharacters, your regular expression will work well, but it will reduce readability. In the character set definition, to use a backslash "\" as a literal character rather than a special meaning, you need to escape it with another backslash. <<[\\x]>> will match a backslash and an X. "]^-" can be escaped with backslashes or put them in a position where they cannot be used to their particular meaning. We recommend the latter because it can increase readability. For example, for the character "^", place it in the position except for the left parenthesis "[", using the literal character meaning rather than the counter meaning. such as <<[x^]>> will match an X or ^. <<[]x]>> will match a "]" or "X". <<[-x]>> or <<[x-]>> will match a "-" or "X".
• Shorthand for character set
Because some character sets are very common, there are some shorthand methods.
<<\d>> Representative <<[0-9]>>;
<<\w>> represents the word character. This differs somewhat from the implementation of the regular expression. The majority of the regular expression implementations of the word character set contain <<A-Za-z0-9_]>>.
<<\s>> represents "white character". This is also related to different implementations. In most implementations, the spaces and tab characters are included, as well as the carriage return line feed <<\r\n>>.
The abbreviated form of a character set can be used within or outside of square brackets. <<\s\d>> matches a white character followed by a number. <<[\s\d]>> match a single white character or number. <<[\da-fA-F]>> will match a hexadecimal number.
Shorthand for the anti-character set
<<[\S]>> = <<[^\s]>>
<<[\W]>> = <<[^\w]>>
<<[\D]>> = <<[^\d]>>
• Repetition of character sets
If you repeat a character set with the "? *+" operator, you will repeat the entire character set. And not just the character that it matches. The regular expression <<[0-9]+>> matches 837 and 222. If you only want to repeat the character that is matched, you can use a backward reference to achieve the goal. We'll talk about backwards references later.
6. Use? * or + to repeat
?
: tells the engine to match the leading character 0 times or once. In fact, it is optional to represent leading characters.
+: tells the engine to match the leading characters 1 or more times
* : tells the engine to match the leading characters 0 or more times
<[A-Za-z][A-Za-z0-9]*> matches HTML tags without attributes, "<" and ">" are literal symbols. The first character set matches one letter, and the second character set matches one letter or number. We also seem to be able to use <[A-Za-z0-9]+>. But it will match <1>. But this regular expression is still valid when you know that the string you are searching for does not contain a similar invalid label.
• Restrictive repetition
Many modern regular expression implementations allow you to define how many times a character repeats. The lexical is: {Min,max}. Both Min and Max are non-negative integers. If there is a comma and Max is ignored, there is no limit to max. If both the comma and Max are ignored, repeat the min time. So {0,} and *, {1,} and + function the same. You can use the <<\b[1-9][0-9]{3}\b>> to match the number between 1000~9999 ("\b" denotes the word boundary). <<\b[1-9][0-9]{2,4}\b>> match a number between 100~99999.
• Beware of greed
Suppose you want to match an HTML tag with a regular expression. You know the input will be a valid HTML file, so the regular expression doesn't need to exclude those invalid tags. So if the content is between the two angle brackets, it should be an HTML tag. Many beginners of regular expressions first think of using regular expressions << <.+> >&gt, and they will be surprised to find that for the test string, "This is a <EM>first</EM> test", You may expect to return to <em&gt, and then proceed to the match, returning to </EM>. But the truth is not. The regular expression will match the "<EM>first</EM>". Obviously this is not the result we want. The reason is that "+" is greedy. In other words, "+" causes the regular expression engine to try to repeat the leading characters as much as possible. The engine will backtrack only if this repetition causes the entire regular expression match to fail. That is, it discards the last "repeat" and then processes the remainder of the regular expression. Similar to "+", the repetition of "? *" is also greedy.
• Deep inside the regular expression engine
Let's take a look at how the regular engine matches the previous example. The first sign is "<", which is a literal symbol. The second symbol is ".", which matches the character "E", and then "+" can always match the remaining characters until the end of the line. Then to the line break, the match fails ("." does not match line breaks). The engine then begins to match the next regular expression symbol. Also try to match ">". So far, "<.+" has matched the "<EM>first</EM> test". The engine tries to match ">" with a newline character, and the result fails. The engine then backtracking. The result is now "<.+" matches "<EM>first</EM> tes". The engine then matches ">" with "T". It is clear that it will fail. This process continues until "<.+" matches "<em>first</em", ">" matches ">". The engine then found a matching "<EM>first</EM>". Remember that the regular-oriented engine is "eager", so it will be anxious to report the first match it finds. Rather than continue backtracking, even if there might be a better match, such as "<EM>." So we can see that because of the "+" greed, the regular expression engine returns the longest matching of the leftmost.
• Substitution of greed with laziness
A possible solution to the above problem is to replace greed with "+" inertia. You can follow the "+" followed by a question mark "?" To reach this point. "*", "{}" and "?" The repetition of the expression can also be used in this scheme. So in the above example we can use "<.+?>". Let's take a look at the process of the regular expression engine. Again, the regular expression notation "<" matches the first "<" of the string. The next regular sign is ".". This time is a lazy "+" to repeat the last character. This tells the regular engine to repeat the last character as little as possible. So the engine matches "." and the character "E" and then ">" to Match "M", the result failed. The engine will backtrack, unlike the previous example, because it is lazy repetition, so the engine is extending lazy repetition rather than reduction, so "<.+" is now extended to "<em". The engine continues to match the next sign ">". A successful match has been made this time. The engine then reported that "<EM>" was a successful match. The whole process is roughly the same.
• An alternative to lazy extensions
We also have a better alternative. Can be repeated with a greedy and a take-back character set: "<[^>]+>". The reason this is a better scenario is that when lazy repetition is used, the engine backtracking each character before it finds a successful match. The use of the inverse character set does not require backtracking. The last thing to remember is that this tutorial is just about a regular-oriented engine. The text-driven engine is not backtracking. But they also do not support lazy repeat operations.
7. Use "." Match almost any character
In the regular expression, the."is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols. “.” Match a single character without caring what the matched character is. The only exception is the new line character. The engine mentioned in this tutorial, by default, does not match the new line character. So by default, "." equals the shorthand for character set [^\n\r] (Window) or [^\n] (Unix). This exception is due to historical reasons. Because the tools used to use regular expressions in the early days are based on rows. They all read a file on one line and apply the regular expression to each row separately. In these tools, strings do not contain new line characters. So "." It does not match the new line character. Modern tools and languages can apply regular expressions to large strings and even entire files. All of the regular expression implementations discussed in this tutorial provide an option to make the "." Matches all characters, including new line breaks. In tools such as Regexbuddy, EditPad Pro or powergrep, you can simply select the "dot matching new line character". In Perl, "." Patterns that can match new line characters are called "Single-line mode." Unfortunately, this is a very confusing term. Because there are also so-called "multi-line mode." Multiple-line mode affects only the anchoring (anchor) at the end of the line, whereas Single-line mode affects only ".". Other languages and regular expression libraries also use the terminology defined in Perl. When you use regular expression classes in the. NET framework, you can activate Single-line mode with a statement similar to the following: Regex.match ("string", "Regex", Regexoptions.singleline) • Conservative use of dot number "."The point number can be said to be the most powerful meta character." It allows you to be lazy: with a point number, you can match almost any character. The problem, however, is that it often matches characters that should not be matched. I'll take a simple example to illustrate. Let's take a look at how to match a date with a "mm/dd/yy" format, but we want to allow the user to select the separator character. One solution that will soon come to mind is <<\d\d.\d\d.\d\d>>. It looks like it can match the date "02/12/03". The problem is that 02512703 will also be considered a valid date. <<\d\d[-/.] \d\d[-/.] \d\d>> seems to be a better solution. Remember that the dot is not a meta character in a character set. This scheme is far from perfect, it will match "99/99/99". and <<[0-1]\d[-/.] [0-3]\d[-/.] \d\d>> went further. Although he will also match "19/39/99". The degree to which you want your regular expression to achieve perfection depends on what you want to achieve. If you want to validate user input, you need to be as perfect as possible. If you just want to analyze a known source and we know that there is no wrong data, it is enough to match the character you want to search with a better regular expression.
8. Anchoring of string start and end
Anchors are different from normal regular expression symbols, and it does not match any characters. Instead, they match the position before or after the character. "^" matches the position before the first character in a line of string. <<^a>> will match a in the string "abc". <<^b>> will not match any of the characters in "ABC". Similarly, the $ matches the position behind the last character in the string. So <<c$>> matches C in "ABC".
• Application of anchoring
It is important to use anchoring when the user input is validated in a programming language. If you want to verify that the user input is an integer, use <<^\d+$>>. In user input, there are often extra leading or ending spaces. You can use <<^\s*>> and <<\s*$>> to match leading or ending spaces.
• Use "^" and "$" as the starting and ending anchors of the line
If you have a string that contains more than one line. For example, "Line\n\rsecond line" (where \n\r represents a new row character). It is often necessary to handle each row separately rather than the entire string. As a result, almost all regular expression engines provide an option to extend the two anchoring meanings. "^" can match the start position of the string (before F), and the trailing position of each new line character (between \n\r and s). Similarly, $ will match the end position of the string (after the last e), and the front of each new line character (between E and \n\r). In. NET, when you use the following code, you define the front and back positions of the anchor match for each new line character: Regex.match ("string", "Regex", Regexoptions.multiline): String str = Regex.Replace (Original, "^", ">", Regexoptions.multiline)--Inserts ">" At the beginning of each line.
• Absolute Anchoring
<<\A>> matches only the start position of the entire string,<<\z>> only matches the end position of the entire string. Even if you use "Multiline mode",<<\a>> and <<\Z>> never match new line characters. Even if \z and $ only match the end position of the string, there is an exception. If the string ends with a new line character, \z and $ will match the position before the new line character, not the end of the entire string. This "improvement" was introduced by Perl and followed by a number of regular expression implementations, including Java. NET and so on. If you apply <<^[a-z]+$>> to "joe\n", the result is "Joe" instead of "joe\n".
9. Word boundaries
Metacharacters <<\b>> is also a "anchor" for matching locations. This match is a 0-length match. There are 4 types of locations that are considered "word boundaries":
1 position before the first character of the string (if the first character of the string is a "word character")
2 The position after the last character of the string (if the last character of the string is a "word character")
3 between a "word character" and "non-word character", where "non-word character" immediately after "word character"
4 between a "non-word character" and "word character", where "word character" immediately after "non-word character"
A "word character" is a character that can be matched with "\w", and "non-word character" is a character that can be matched with "\w". In most regular expression implementations, "word characters" usually include <<[a-zA-Z0-9_]>>. For example,:<<\b4\b>> can match a single 4 rather than a portion of a larger number. This regular expression does not match the 4 in "44". In other words, you can almost say <<\b>> match the start and end position of an alphanumeric sequence. The "Word boundary" is set to <<\b>&gt, and the position he wants to match is between two "word characters" or two "non-word characters".
• Deep inside the regular expression engine
Let's take a look at applying the regular expression <<\bis\b>> to the string "This island is Beautiful". The engine processes symbol <<\b>> first. Since the \b is 0 length, the position of the first character T front will be examined. Because T is a "word character", the preceding character is a null character (void), so \b matches the word boundary. Then the <<i>> and the first character "T" match failed. The matching process continues until the fifth spaces, and the fourth character "s" matches the <<\b>>. However, spaces and <<i>> do not match. Continues backwards, to the sixth character "I", and the fifth space character matches the <<\b>&gt, and then <<is>> matches the six and seventh characters. However, the eighth character and the second word boundary do not match, so the match fails again. To the 13th character I, because the "word boundary" is formed with the previous spaces and the <<is>> is matched with "is". The engine then tries to match the second <<\b>>. Because the 15th spaces and "s" form a word boundary, the match succeeds. The engine is "in a hurry" to return the result of a successful match.
10. Selection character
"|" in regular expressions Represents a selection. You can match one of several possible regular expressions with a selector. If you want to search for words like "cat" or "dog", you can use <<cat|dog>>. If you want to have more options, you just expand the list <<cat|dog|mouse|fish>>. A selector has the lowest priority in a regular expression, that is, it tells the engine to either match all the expressions to the left of the selector or match all the expressions on the right. You can also use parentheses to limit the range of selectors. such as <<\b (Cat|dog) \b>&gt, which tells the regular engine to treat (Cat|dog) as a regular expression unit.
• Attention to the "Rush Biaogong" nature of the regular engine
The regular engine is urgent, and when it finds a valid match, it stops the search. Therefore, under certain conditions, the order of the expressions on both sides of the selector will have an effect on the result. Suppose you want to use a regular expression to search for a list of functions: Get,getvalue,set or SetValue. A clear solution is <<get|. getvalue| Set| Setvalue>>. Let's look at the results when searching for SetValue. Because <<Get>> and <<GetValue>> failed, the <<Set>> match was successful. Because the regular-oriented engine is "eagerly", it returns the first successful match, "Set", rather than continuing to search for a better match. Contrary to our expectations, the regular expression does not match the entire string. There are several possible solutions. One is to take into account the "urgent" nature of the regular engine, change the order of options, such as we use <<getvalue| Get| setvalue| Set>&gt, so that we can first search for the longest match. We can also combine four options into two options: <<get (Value)? | Set (Value)?>>. Because the question mark repeat is greedy, SetValue will always be matched before set.

A better solution is to use word boundaries:
<<\b (get| getvalue| Set| SetValue) \b>> or <<\b (Get (Value)? | Set (Value)?\b>>. Further, since all options have the same ending, we can optimize the regular expression to <<\b (get| Set) (Value)?\b>>.
11. Group and backward references
Put part of the regular expression in parentheses, and you can form them into groups. You can then use some regular actions for the entire group, such as repeat operators. Note that only the parentheses "()" can be used to form a group. [] is used to define the character set. ' {} ' is used to define duplicate operations. When a regular expression group is defined with "()", the regular engine stores the matched groups in sequential numbers and caches them. When the matched group is referenced backwards, it can be referenced using the "\ Number" method. <<\1>> reference the first matching,<<\2>> reference group, and then the second group, and so on,<<\n>> references the nth group. <<\0>> refers to the entire matched regular expression itself. Let's look at an example. Suppose you want to match the start and end tags of an HTML tag, as well as the text in the middle of the label. For example <b>this is a test</b&gt, we want to match <B> and </B> as well as the middle text. We can use the following regular expression: "< ([a-z][a-z0-9]*) [^>]*>.*?</\1>"
First, "<" will match the first character "<" of "<B>". Then [A-z] match b,[a-z0-9]* will match 0 to multiple alphanumeric characters followed by 0 to more than ">." The ">" of the last regular expression will match the ">" of "<B>". The regular engine then lazily matches the characters before the end tag until a "</" symbol is encountered. The "\1" in the regular expression then refers to the previously matched group "([a-z][a-z0-9]*)", in this case, the label name "B" is referenced. So the end tag that needs to be matched is "</B>" you can make multiple references to the same back reference group,<< ([a-c]) x\1x\1>> will match "Axaxa", "Bxbxb", and "CXCXC". If a group referenced in numbers does not have a valid match, the referenced content is simply empty. A back reference cannot be used for itself. << ([abc]\1) >> is wrong. So you can't use <<\0>> for a regular expression to match itself, it can only be used in substitution operations. The back reference cannot be used within the character set. <<\1>> in the << (a) [\1b]>> does not represent a forward reference.  Within the character set the,<<\1>> can be interpreted as the octal form of the transcoding. A backward reference lowers the engine's speed because it needs to store matching groups. If you do not need to refer back, you can tell the engine not to store a group. For example: <<get (?: Value) >>. where "(" followed by "?:" tells the engine that for group (value), no matching value is stored for a back reference.)
• Repeat and forward references
When a repeating operator is used on a group, the post-referenced content in the cache is refreshed and only the last match is kept. For example,:<< ([abc]+) =\1>> will match "Cab=cab", but << ([ABC]) +=\1>> will not. Because ([ABC]) matches "C" for the first time, "\1" represents "C", and ([ABC]) continues to match "a" and "B". Finally "\1" represents "B", so it matches "cab=b". Application: Check for repeated words-when editing text, it is easy to enter repeated words, such as "the". These duplicate words can be detected using <<\b (\w+) \s+\1\b>>. To remove the second word, simply replace "\1" with the replacement function.
• Group naming and referencing
In Php,python, you can use the << (? P<name>group) >> to name the group. In this case, the lexical? P<name> is naming groups (group). Where name is your name for the group. You can use (? P=name) for reference.
. NET-named group
The. NET framework also supports named groups. Unfortunately, Microsoft programmers decided to invent their own grammar instead of using Perl and Python rules. So far, no other regular expression has been implemented to support the syntax of Microsoft's invention. Here's the. NET example: (? <first>group) (? ') Second ' group) as you can see,. NET provides two kinds of lexical to create named groups: one is to use the angle bracket "<>" or the single quotation mark "" ". Angle brackets are easier to use in strings, and single quotes are more useful in ASP code, because "<>" in ASP code is used as an HTML tag. To refer to a named group, use \k<name> or \k ' name '. You can use "${name}" to refer to a named group when you make a search substitution.
12. Matching pattern of regular expressions
The regular expression engines discussed in this tutorial support three matching modes:
<</i>> makes regular expressions insensitive to capitalization,
<</s>> Open "single mode", point number "." Match New Line character
<</m>> opens multiline mode, where "^" and "$" match the front and rear positions of new line characters.
• Open or close a pattern inside a regular expression
If you insert a modifier (? ISM) inside a regular expression, the modifier only works on the regular expression on the right side of the formula. (? i) is off-case insensitive. You can test it quickly. The << (? i) te (? i) st>> should match test, but it cannot match test or test.
13. Atomic groups and preventing backtracking
In some special cases, because backtracking can make the engine extremely inefficient. Let's look at one example: to match such a string, each field in a string is delimited by a comma, and the 12th field begins with P. It is easy to think of such regular expression <<^ (. *?,) {11}p>>. This regular expression works well under normal circumstances. In extreme cases, however, catastrophic backtracking occurs if the 12th field is not preceded by P. If the string you want to search for is "1,2,3,4,5,6,7,8,9,10,11,12,13". First, the regular expression has been successfully matched until the 12th character. At this point, the preceding regular expression consumes the string "1,2,3,4,5,6,7,8,9,10,11," and to the next character,<<p>> does not match "12". So the engine is backtracking, when the regular expression consumes the string "1,2,3,4,5,6,7,8,9,10,11". Proceed to the next match, the next regular symbol is the dot <<.>&gt, and you can match the next comma ",". However, <<,>> does not match the "1" in the character "12". Match failed, continue backtracking. As you can imagine, such a retrospective combination is a very large number. This could cause the engine to crash.
There are several scenarios for preventing such a huge backtracking:
A simple solution is to make the match as accurate as possible. Use the inverse character set instead of the dot number. For example, we use the following regular expression <<^ ([^,\r\n]*,) {11}p>&gt, so that the number of failed backtracking can be reduced to 11 times. Another scenario is to use atomic groups. The purpose of the atomic group is to make the regular engine fail a little faster. Therefore, it can effectively prevent massive backtracking. The syntax of an atomic group is << (?> regular expression) >>. All regular expressions located between (?>) are considered to be a single regular symbol. Once the match fails, the engine will backtrack back to the regular expression section in front of the atomic group. The preceding example uses atomic groups to express <<^ (?> (. *?,) {one}) p>>. Once the 12th field match fails, the engine goes back to the <<^>> in front of the atomic group.
14. Forward View and backward view
Perl 5 introduces two powerful regular syntax: "Look forward" and "view backwards." They are also known as "0-length assertions". They are 0 lengths like anchors (the so-called 0-length means that the regular expression does not consume the matched string). The difference is that "before and after" will actually match the characters, but they will discard the match only to return the matching result: match or mismatch. That is why they are called "assertions". They do not actually consume the characters in the string, but simply assert that a match is possible. Almost all of the regular expression implementations discussed in this article support "View backwards". The only exception is that JavaScript only supports forward viewing.
• Positive and negative forward viewing
As we have mentioned earlier: to find a Q, there is no following a U. That is, either there is no character behind the Q, or the following character is not U. A solution with negative forward view is <<q (?!). u) >>. The syntax for negative forward viewing is << (?! What to view) >>. Positive forward view and negative forward view very similar to:<< (? = view content) >>. If there is a group in the "Viewed Content" section, a backward reference is also generated. However, the forward view itself does not produce a backward reference and is not counted in the number that is referenced backwards. This is because the forward view itself is discarded, leaving only the judgement of the match or not. If you want to keep the matching result as a backward reference, you can use the << (? = (regex)) >> to produce a backward reference.
• Positive and negative views
The same effect is viewed backwards and forwards, except that the syntax for backward viewing is:<< (? <! view) >> Positive backward View syntax is:<< (? <= view content) >> we can see that There is an extra left angle bracket that represents the direction compared to the forward view. Example:<< (? <!a) b>> will match a "B" without "a" as the leading character. It is worth noting that the forward view matches the "view" regular expression starting at the current string position, and the backward view begins by backtracking a character from the current string position before starting to match the view regular expression.
• Deep inside the regular expression engine
Let's look at a simple example. <<q the regular expression (?!) u) >> applied to the string "Iraq". The first symbol of a regular expression is <<q>>. As we know, the engine sweeps through the entire string before matching <<q>>. When the fourth character "Q" is matched, "Q" is followed by a null character (void). And the next regular symbol is viewed forward. The engine noted that a forward view of the regular expression section had been entered. The next regular symbol is &LT;&LT;U&GT;&GT, and the null character does not match, causing the regular expression match in the Forward view to fail. Because it is a negative forward view, it means that the entire forward view results are successful. So the match result "Q" is returned. We are applying the same regular expression to "quit". <<q>> matches the "Q". The next regular symbol is the <<u>&gt of the forward viewing section, which matches the second character "I" in the string. The engine continues to go to the next character "I". However, the engine noticed that the forward viewing section had been processed and the forward view had been successful. The engine then discards the string part that is matched, which causes the engine to fall back to the character "U". Because looking forward is a stereotype, it means that viewing a part of a successful match causes the entire forward view to fail, so the engine has to backtrack. Finally because there is no other "Q" and <<q>> match, so the whole match failed. To ensure that you can clearly understand the implementation of the Forward view, let's apply <<q (=u) i>> to "quit". <<q>> first matches "Q". Then look forward to the successful match "U", the matching part is discarded, only return can match the judgment result. The engine returns from the character "I" to "U". The engine continues to process the next regular symbol <<i>> because the forward view succeeds. The result found that <<i>> and "U" did not match. Therefore, the match failed. The entire regular expression has failed to match because there is no other "q" behind it.
• Further understanding of the internal mechanism of regular expression engines
Let's apply the << (<=a) b>> to "Thingamabob". The engine begins processing the back-viewing portion of the regular symbol and the first character in the string. In this example, the backward view tells the regular expression engine to rollback a character and then see if a "a" is matched. The engine cannot be rolled back because there are no characters in front of "T". Therefore, the backward view failed. The engine continues to go to the next character "H". Again, the engine briefly returns a character and checks to see if a "a" is matched. The result found a "t". The backward view failed again. The backward view continues to fail until the regular expression reaches the "M" in the string, and the affirmative backward view is matched. Because it is 0-length, the current position of the string is still "M". The next regular symbol is <<b>>, and "M" Match failed. The next character is the second "a" in the string. The engine briefly returns a character backward and finds that <<a>> does not match "M". The next character is the first "B" in the string. The engine temporarily backwards a character finds that the backward view is satisfied, while the <<b>> matches the "B". So the entire regular expression is matched. As a result, the regular expression returns the first "B" in the string.
• Apply forward and backward view
Let's look at an example that looks for a word with a 6-bit character that contains "cat". First, we can solve the problem without looking forward backwards, for example:
<< cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat>> is easy enough. But when the demand turns to finding a word that has 6-12-bit characters and contains "cat", "dog" or "mouse", this approach becomes awkward. Let's take a look at the scenario using forward viewing. In this example, we have two basic requirements to meet: one is that we need a 6-bit character, and the second is that the word contains "cat". The regular expression that satisfies the first requirement is <<\b\w{6}\b>>. The regular expression that satisfies the second requirement is <<\b\w*cat\w*\b>>. Combining the two, we can get the following regular expression:<< (? =\b\w{6}\b) \b\w*cat\w*\b>> the specific matching process left to the reader. However, it is important to note that a forward view does not consume characters, so when a word is judged to satisfy a 6-character condition, the engine continues to match the subsequent regular expression from the position before the start of the decision. Finally, you can get the following regular expression: <<\b (? =\w{6}\b) \w{0,3}cat\w*>>
15. Conditional testing in regular expressions
The syntax for the conditional test is << (? ifthen|else) >>. The "If" section can be a forward-backward view of an expression. If you look forward, the syntax changes to:<< (? =regex) then|else) >&gt, where the else part is optional. If the if part is true, the regular engine attempts to match the then part, otherwise the engine attempts to match the else part. It is to be remembered that the forward view does not actually consume any characters, so the subsequent then with the else part begins with the part before the if test.
16. Add a comment for a regular expression
The syntax for adding annotations in regular expressions is:<< >> Example: Add a comment for a regular expression that matches a valid date:
  (#year) (19|20) \d\d[-/.] (? #month) (0[1-9]|1[012]) [- /.] (? #day) (0[1-9]| [12] [0-9]|3[01])

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.