Objective:
Six months ago, I was interested in regular expressions, found a lot of information on the Internet, read a lot of tutorials, and finally in the use of a regular expression tool Regexbuddy to find his tutorial written very well, can be said that I have seen the best regular expression tutorial. So I always wanted to translate him. This wish was not realized until the 51 long vacation, and the result was this article. On the name of this article, the use of "simple" seems to have been too vulgar. But after reading through the original text, feel that only with "simple" can accurately express the experience of the tutorial to me, so it can not be exception.
This article is a regexbuddy written by Goyvaerts for the translation of the tutorial, copyright belongs to the original author all, welcomed the reprint. But in order to respect the work of the original author and the translator, please indicate the source. Thank you.
1. What is a regular expression
Basically, a regular expression is a pattern used to describe a certain amount of text. The regex represents regular Express. This article will use <<regex>> to represent a specific regular expression.
A piece of text is the most basic pattern, simple to match the same text.
2. Different regular expression engines
The regular expression engine is a software that can handle regular expressions. Typically, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial focuses on the Perl 5-type engine, which is the most widely used engine. We will also mention some differences from other engines. Many modern engines are similar, but not exactly the same. For example. NET regular library, JDK regular package.
3. Text symbols
The most basic regular expression consists of a single literal symbol. such as <<a>>, which matches the first occurrence of the character "a" in the string. such as the string "Jack is a boy". "A" after "J" will be matched. And the second "a" will not be matched.
The regular expression can also match the second "a", which must be you telling the regular expression engine to start the search from the first match. In a text editor, you can use "Find Next". In a programming language, there is a function that allows you to continue searching backwards in the same position as the previous match.
Similar,<<cat>> will match "cat" in "About cats and dogs". This is tantamount to telling the regular expression engine to find a <<c>>, followed by a <<a>>, and a <<t>>.
Note that the regular expression engine defaults to case sensitive. <<cat>> does not match "cat" unless you tell the engine to ignore the case.
• Special characters
For literal characters, 11 characters are reserved for special purposes. They are:
[ ] / ^ $ . | ? * + () These special characters are also called Meta characters.
If you want to use these characters as text characters in regular expressions, you need to use a backslash "/" To change the Code (escape). For example you want to match "1+1=2", the correct expression is <<1/+1=2>>
It should be noted that,<<1+1=2>> is also a valid regular expression. However, it does not match "1+1=2" but matches "111=2" in "123+111=234". Because "+" here represents a special meaning (repeat 1 times to many times).
In programming languages, note that some special characters are processed by the compiler before being passed to the regular engine. So regular expression <<1/+2=2>> in C + + should be written "1//+1=2". In order to match "c:/temp", you need to use regular expression <<C://temp>>. In C + +, the regular expression becomes "c:////temp".
• Non-display characters
You can use a special sequence of characters to represent some characters that are not displayed:
<</t>> representative Tab (0X09)
<</r>> on behalf of carriage return (0x0D)
<</n>> stands for line breaks (0x0A)
Note that the Windows Chinese document uses "/r/n" to end a line and Unix uses "n".
4. The internal working mechanism of the regular expression engine
Knowing how the regular expression engine works can help you quickly understand why a regular expression doesn't work as you expect.
There are two types of engines: text-oriented (text-directed) engines and regular-oriented (regex-directed) engines. Jeffrey Friedl called them the DFA and NFA engines. This article is about a regular-oriented engine. This is because of some very useful features, such as "lazy" quantifiers (lazy quantifiers) and reverse references (backreferences), which can only be implemented in a regular-oriented engine. So it's no surprise that this engine is the most popular engine at the moment.
You can easily tell whether the engine you are using is text-oriented or regular-oriented. If the reverse reference or "lazy" classifier is implemented, you can be sure that the engine you are using is regular oriented. You can test the following: Apply the regular expression <<regex|regex not>> to the string "Regex not". If the result of the match is a regex, the engine is regular oriented. If the result is a regex not, it is text-oriented. Because the regular-oriented engine is "monkey rush", it will be eager to biaogong and report the first match it finds.
• The regular-oriented engine always returns the leftmost match
This is a very important point to understand: even if it is possible to find a "better" match later, the regular-oriented engine always returns the leftmost match.
When the <<cat>> is applied to "he captured a catfish to his cat", the engine first compares <<c>> and "H", and the results fail. The engine then compares <<c>> and "E" and fails. Until the fourth character,<<c>> matches "C". <<a>> matches the fifth character. To the sixth character <<t>> failed to match "P". The engine then continues to re-examine the match from the fifth character. Until the start of the 15th character,<<cat>> matches "cat" in "Catfish", the regular expression engine eagerly returns the first matching result without continuing to look for a better match.
5. Character Set
A character set is a set of characters enclosed by a pair of brackets "[]". With character sets, you can tell the regular expression engine to match only one of several characters. If you want to match a "a" or an "E", use <<[ae]>>. You can use <<gr[ae]y>> to match gray or grey. This is especially useful when you're not sure if the character you're searching for is in American or British English. Conversely,<<gr[ae]y>> will not match graay or Graey. The order of characters in the character set does not matter, and the results are the same.
You can use the hyphen "-" to define a character range as a character set. <<[0-9]>> matches a single number from 0 to 9. You can use more than one range. <<[0-9A-FA-F] >> matches a single hexadecimal number and is not case sensitive. You can also combine a range definition with a single character definition. <<[0-9a-fxA-FX]>> match a hexadecimal digit or letter x. Again, the sequence of character and range definitions has no effect on the result.
• Some applications of character sets
Find a word that may have misspelled words, such as <<sep[ae]r[ae]te>> or <<li[cs]en[cs]e>>.
The identifier,<<a-za-z_][a-za-z_0-9]*>> for the lookup program language. (* means repeat 0 or more times)
Find the C-style hexadecimal number <<0[xX][A-Fa-f0-9]+>>. (+ means repeat one or more times)
• Take the anti-character set
After the opening parenthesis "[" followed by an angle bracket "^", the character set is reversed.) The result is that the character set matches any character that is not in square brackets. Unlike ".", the inverse character set can match the return line feed character.
The important thing to remember is that you have to match one character to the counter character set. <<q[^u]>> does not mean: match a Q, followed by no u followed. It means: match a Q, followed by a character that is not U. So it will not match the Q in "Iraq", but will match the Q and a spaces in "Iraq is a country". In fact, spaces is part of the match because it is a "not a U character".
If you only want to match a Q, the condition is that there is a character that is not u after Q, and we can use the forward view to solve it later.
• Meta characters in the character set
It is important to note that only 4 characters in the character set have special meaning. They are: "/^-". "]" represents the end of the character set definition, "/" represents the Escape, "^" represents the inverse; "-" represents the scope definition. Other common metacharacters are normal characters inside the character set definition and do not need to be escaped. For example, to search for an asterisk * or Plus +, you can use <<[+*]>>. Of course, if you escape from the usual metacharacters, your regular expression will work well, but it will reduce readability.
In the character set definition, to use a backslash "/" as a literal character rather than a special meaning, you need to escape it with another backslash. <<[//x]>> will match a backslash and an X. "]^-" can be escaped with backslashes or put them in a position where they cannot be used to their particular meaning. We recommend the latter because it can increase readability. For example, for the character "^", place it in the position except for the left parenthesis "[", using the literal character meaning rather than the counter meaning. such as <<[x^]>> will match an X or ^. <<[]x]>> will match a "]" or "X". <<[-x]>> or <<[x-]>> will match a "-" or "X".
• Shorthand for Character set
Because some character sets are very common, there are some shorthand methods.
<</d>> Representative <<[0-9]>>;
<</w>> represents the word character. This differs somewhat from the implementation of the regular expression. The majority of the regular expression implementations of the word character set contain <<A-Za-z0-9_]>>.
<</s>> represents "white character". This is also related to different implementations. In most implementations, the spaces and tab characters are included, as well as the carriage return line feed <</r/n>>.
The abbreviated form of a character set can be used within or outside of square brackets. <</s/d>> matches a white character followed by a number. <<[/s/d]>> match a single white character or number. <<[/da-fA-F]>> will match a hexadecimal number.
Shorthand for the anti-character set
<<[/S]>> = <<[^/s]>>
<<[/W]>> = <<[^/w]>>
<<[/D]>> = <<[^/d]>>
• Repetition of character sets
If you repeat a character set with the "? *+" operator, you will repeat the entire character set. And not just the character that it matches. The regular expression <<[0-9]+>> matches 837 and 222.
If you only want to repeat the character that is matched, you can use a backward reference to achieve the goal. We'll talk about backwards references later.
6. Use? * or + to repeat
?: Tell the engine to match the leading characters 0 times or once. In fact, it is optional to represent leading characters.
+: Tell engine to match leading characters 1 or more times
*: Tell engine to match leading characters 0 or more times
<[A-Za-z][A-Za-z0-9]*> matches HTML tags without attributes, "<" and ">" are literal symbols. The first character set matches one letter, and the second character set matches one letter or number.
We also seem to be able to use <[A-Za-z0-9]+>. But it will match <1>. But this regular expression is still valid when you know that the string you are searching for does not contain a similar invalid label.
• Restrictive repetition
Many modern regular expression implementations allow you to define how many times a character repeats. The lexical is: {Min,max}. Both Min and Max are non-negative integers. If there is a comma and Max is ignored, there is no limit to max. If both the comma and Max are ignored, repeat the min time.
So {0,} and *, {1,} and + function the same.
You can use the <</b[1-9][0-9]{3}/b>> to match the numbers between 1000~9999 ("/b" to denote the word boundary). <</b[1-9][0-9]{2,4}/b>> match a number between 100~99999.
• Beware of Greed
Suppose you want to match an HTML tag with a regular expression. You know the input will be a valid HTML file, so the regular expression doesn't need to exclude those invalid tags. So if the content is between the two angle brackets, it should be an HTML tag.
Many beginners of regular expressions first think of using regular expressions << <.+> >>, and they will be surprised to find that for the test string, "This is a <EM>first</EM> test", You may expect to return to <em>, and then proceed to the match, returning to </EM>.
But the truth is not. The regular expression will match the "<EM>first</EM>". Obviously this is not the result we want. The reason is that "+" is greedy. In other words, "+" causes the regular expression engine to try to repeat the leading characters as much as possible. The engine will backtrack only if this repetition causes the entire regular expression match to fail. That is, it discards the last "repeat" and then processes the remainder of the regular expression.
Similar to "+", the repetition of "? *" is also greedy.
• Deep inside the regular expression engine
Let's take a look at how the regular engine matches the previous example. The first sign is "<", which is a literal symbol. The second symbol is ".", which matches the character "E", and then "+" can always match the remaining characters until the end of the line. Then to the line break, the match fails ("." does not match line breaks). The engine then begins to match the next regular expression symbol. Also try to match ">". So far, "<.+" has matched the "<EM>first</EM> test". The engine tries to match ">" with a newline character, and the result fails. The engine then backtracking. The result is now "<.+" matches "<EM>first</EM> tes". The engine then matches ">" with "T". It is clear that it will fail. This process continues until "<.+" matches "<em>first</em", ">" matches ">". The engine then found a matching "<EM>first</EM>". Remember that the regular-oriented engine is "eager", so it will be anxious to report the first match it finds. Rather than continue backtracking, even if there might be a better match, such as "<EM>." So we can see that because of the "+" greed, the regular expression engine returns the longest matching of the leftmost.
• Laziness instead of greed
A possible solution to the above problem is to replace greed with "+" inertia. You can follow the "+" followed by a question mark "?" To reach this point. "*", "{}" and "?" The repetition of the expression can also be used in this scheme. So in the above example we can use "<.+?>". Let's take a look at the process of the regular expression engine.
Again, the regular expression notation "<" matches the first "<" of the string. The next regular sign is ".". This time is a lazy "+" to repeat the last character. This tells the regular engine to repeat the last character as little as possible. So the engine matches "." and the character "E" and then ">" to Match "M", the result failed. The engine will backtrack, unlike the previous example, because it is lazy repetition, so the engine is extending lazy repetition rather than reduction, so "<.+" is now extended to "<em". The engine continues to match the next sign ">". A successful match has been made this time. The engine then reported that "<EM>" was a successful match. The whole process is roughly the same.
• An alternative to lazy extensions
We also have a better alternative. Can be repeated with a greedy and a take-back character set: "<[^>]+>". The reason this is a better scenario is that when lazy repetition is used, the engine backtracking each character before it finds a successful match. The use of the inverse character set does not require backtracking.
The last thing to remember is that this tutorial is just about a regular-oriented engine. The text-driven engine is not backtracking. But they also do not support lazy repeat operations.
7. Use "." Match almost any character
In the regular expression, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols.
“.” Match a single character without caring what the matched character is. The only exception is the new line character. The engine mentioned in this tutorial, by default, does not match the new line character. So by default, "." equals the shorthand for character set [^/N/R] (Window) or [^/n] (Unix).
This exception is due to historical reasons. Because the tools used to use regular expressions in the early days are based on rows. They all read a file on one line and apply the regular expression to each row separately. In these tools, strings do not contain new line characters. So "." It does not match the new line character.
Modern tools and languages can apply regular expressions to large strings and even entire files. All of the regular expression implementations discussed in this tutorial provide an option to make the "." Matches all characters, including new line breaks. In tools such as Regexbuddy, EditPad Pro or powergrep, you can simply select the "dot matching new line character". In Perl, "." Patterns that can match new line characters are called "Single-line mode." Unfortunately, this is a very confusing term. Because there are also so-called "multi-line mode." Multiple-line mode affects only the anchoring (anchor) at the end of the line, whereas Single-line mode affects only ".".
Other languages and regular expression libraries also use the terminology defined in Perl. When you use regular expression classes in the. NET framework, you can activate Single-line mode with a statement similar to the following: Regex.match ("string", "Regex", Regexoptions.singleline)
• Conservative use of dot number "."
The point number can be said to be the most powerful meta character. It allows you to be lazy: with a point number, you can match almost any character. The problem, however, is that it often matches characters that should not be matched.
I'll take a simple example to illustrate. Let's take a look at how to match a date with a "mm/dd/yy" format, but we want to allow the user to select the separator character. One solution that will soon come to mind is <</d/d./d/d./d/d>>. It looks like it can match the date "02/12/03". The problem is that 02512703 will also be considered a valid date.
<</d/d[-/.] /d/d[-/.] /d/d>> seems to be a better solution. Remember that the dot is not a meta character in a character set. This scheme is far from perfect, it will match "99/99/99". and <<[0-1]/d[-/.] [0-3]/d[-/.] /d/d>> went further. Although he will also match "19/39/99". The degree to which you want your regular expression to achieve perfection depends on what you want to achieve. If you want to validate user input, you need to be as perfect as possible. If you just want to analyze a known source and we know that there is no wrong data, it is enough to match the character you want to search with a better regular expression.
8. Anchoring of string start and end
Anchors are different from normal regular expression symbols, and it does not match any characters. Instead, they match the position before or after the character. "^" matches the position before the first character in a line of string. <<^a>> will match a in the string "abc". <<^b>> will not match any of the characters in "ABC".
Similarly, the $ matches the position behind the last character in the string. So <<c$>> matches C in "ABC".
• Application of Anchoring
It is important to use anchoring when the user input is validated in a programming language. If you want to verify that the user input is an integer, use <<^/d+$>>.
In user input, there are often extra leading or ending spaces. You can use <<^/s*>> and <</s*$>> to match leading or ending spaces.
• Use "^" and "$" as the starting and ending anchors of the line
If you have a string that contains more than one line. For example, "Line/n/rsecond line" (where/n/r represents a new row character). It is often necessary to handle each row separately rather than the entire string. As a result, almost all regular expression engines provide an option to extend the two anchoring meanings. "^" can match the start position of the string (before F), and the trailing position of each new line character (between/N/R and s). Similarly, $ will match the end position of the string (after the last e), and the front of each new line character (between E and/n/r).
In. NET, when you use the following code, you define the front and back positions of the anchor match for each new line character: Regex.match ("string", "Regex", Regexoptions.multiline)
Application: String str = Regex.Replace (Original, "^", ">", Regexoptions.multiline)--Inserts ">" At the beginning of each line.
• Absolute Anchoring
<</A>> matches only the start position of the entire string,<</z>> only matches the end position of the entire string. Even if you use "Multiline mode",<</a>> and <</Z>> never match new line characters.
Even if/z and $ only match the end position of the string, there is an exception. If the string ends with a new line character,/z and $ will match the position before the new line character, not the end of the entire string. This "improvement" was introduced by Perl and followed by a number of regular expression implementations, including Java. NET and so on. If you apply <<^[a-z]+$>> to "joe/n", the result is "Joe" instead of "joe/n".