Objective:
Half a year ago, I was interested in regular expressions, looked for a lot of information on the Internet, read a lot of tutorials, finally, when using a regular expression tool Regexbuddy, found that his tutorial is very good, it can be said that I have seen the best regular expression tutorial. So I always wanted to translate him over.
This article is a translation of the tutorial written by Jan Goyvaerts for Regexbuddy, which is copyrighted by the original author and is welcome to reprint. But in order to respect the work of the original author and translator, please specify the source! Thank you!
1. What is a regular expression
Basically, a regular expression is a pattern used to describe a certain amount of text. The regex represents regular Express. This article will use <<regex>> to represent a specific regular expression.
A piece of text is the most basic pattern, which simply matches the same text.
2. Different regular expression engines
The regular expression engine is a software that can handle regular expressions. Typically, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial focuses on the Perl 5 type of engine, which is the most widely used engine. We'll also mention some of the differences with other engines. Many modern engines are very similar, but not exactly the same. For example. NET regular library, JDK regular package.
3. Text symbols
The most basic regular expression consists of a single literal symbol. As <<a>>, it will match the first occurrence of the character "a" in the string. As to the string "Jack is a boy". "A" after "J" will be matched. The second "a" will not be matched.
The regular expression can also match the second "a", which must be the time you tell the regular expression engine to start the search from where it first matched. In a text editor, you can use Find next. In the programming language, there is a function that will allow you to start searching backwards from the previous match.
Similar,<<cat>> will match "cat" in "About cats and dogs". This is tantamount to telling the regular expression engine to find a <<c>>, followed by a <<a>>, and a <<t>>.
Note that the regular expression engine is case-sensitive by default. <<cat>> does not match "cat" unless you tell the engine to ignore the case.
For literal characters, 12 characters are reserved for special purposes. They are:
[ ] \ ^ $ . | ? * + ( )
These special characters are also known as meta-characters.
If you want to use these characters as text characters in a regular expression, you need to swap them with a backslash "\" (Escape). For example you want to match "1+1=2", the correct expression is <<1\+1=2>>.
It is important to note that,<<1+1=2>> is also a valid regular expression. However, it does not match "1+1=2" and will match "111=2" in "123+111=234". Because "+" here represents a special meaning (repeated 1 times to several times).
In programming languages, it is important to note that some special characters are processed by the compiler before being passed to the regular engine. So the regular expression <<1\+2=2>> in C + + should be written "1\\+1=2". In order to match "C:\temp", you need to use regular expression <<C:\\temp>>. In C + +, the regular expression becomes "c:\\\\temp".
You can use special character sequences to represent certain non-displayed characters:
<<\t>> Delegate tab (0X09)
<<\r>> represents a carriage return (0x0D)
<<\n>> representing newline characters (0x0A)
Note that the Windows Chinese text file uses "\ r \ n" to end a line and UNIX uses "\ n".
4.
the internal working mechanism of the regular expression engine
Knowing how the regular expression engine works helps you quickly understand why a regular expression doesn't work as you expect.
There are two types of engines: the text-oriented (text-directed) engine and the regular-guided (regex-directed) engine. Jeffrey Friedl called them the DFA and NFA engines. This article talks about a regular-oriented engine. This is because some very useful features, such as the "lazy" quantifier (lazy quantifiers) and the reverse reference (backreferences), can only be implemented in a regular-oriented engine. So it's no surprise that this engine is the most popular engine at the moment.
You can easily tell if the engine you are using is text-oriented or regular-oriented. If a reverse reference or "lazy" quantifier is implemented, you can be sure that the engine you are using is regular-oriented. You can test the following: Apply the regular expression <<regex|regex not>> to the string "Regex not". If the result of the match is a regex, the engine is regular-oriented. If the result is a regex not, it is text-oriented. Because the regular-oriented engine is "monkey rush", it will be very eager to biaogong and report the first match it finds.
- The regular-oriented engine always returns the leftmost match
This is an important point to understand: even if it is possible to find a "better" match later, the regular-oriented engine will always return to the leftmost match.
When applying <<cat>> to "He captured a catfish for his cat", the engine compares <<c>> and "H" first, and the result fails. So the engine compares <<c>> and "E", also failed. Until the fourth character,<<c>> matches "C". <<a>> matches the fifth character. To the sixth character <<t>> failed to match "P", too. The engine continues to re-check the match from the fifth character. Until the 15th character starts,<<cat>> match "cat" in "Catfish", the regular expression engine eagerly returns the first matching result without continuing to find any other better match.
5. Character Set
A character set is a set of characters enclosed by a square bracket "[]". Using a character set, you can tell the regular expression engine to match only one of several characters. If you want to match a "a" or an "E", use <<[ae]>>. You can use <<gr[ae]y>> to match gray or grey. This is especially useful if you're not sure if the characters you're searching for are in American or British English. Conversely,<<gr[ae]y>> will not match graay or Graey. The character order in the character set does not matter, and the result is the same.
You can use the hyphen "-" to define a character range as a character set. <<[0-9]>> matches a single number from 0 to 9. You can use more than one range. <<[0-9A-FA-F] >> matches a single hexadecimal number and is case insensitive. You can also combine a range definition with a single character definition. <<[0-9a-fxA-FX]>> matches a hexadecimal digit or letter x. Again, the sequence of character and range definitions has no effect on the results.
- Some applications of character sets
Find a word that may be misspelled, such as <<sep[ae]r[ae]te>> or <<li[cs]en[cs]e>>.
The identifier for the lookup program language,<<a-za-z_][a-za-z_0-9]*>>. (* indicates repeat 0 or more times)
Find the C-style hexadecimal number <<0[xX][A-Fa-f0-9]+>>. (+ = Repeat one or more times)
- Take the inverse character set
In the left parenthesis "[" followed by an angle bracket "^", the character set will be reversed. The result is that the character set will match any characters that are not in square brackets. Unlike the ".", the inverse character set is matched to the carriage return newline.
It is important to remember that the inverse character set must match one character. <<q[^u]>> does not mean: match a Q, followed by no U. It means: Match a Q followed by a character that is not a U. So it does not match Q in "Iraq", but matches Q and a space character in "Iraq is a country". In fact, a space character is a part of the match, because it is a "not a U".
If you only want to match a Q, the condition is that Q has a character that is not a u, and we can solve it with the forward view we'll talk about later.
- Metacharacters in the character set
It is important to note that only 4 characters in a character set have special meanings. They are: "] \ ^-". "]" represents the end of a character set definition; "\" means escape; "^" represents the inverse; "-" represents the scope definition. Other common metacharacters are normal characters inside the character set definition and do not need to be escaped. For example, to search for an asterisk * or Plus +, you can use <<[+*]>>. Of course, if you escape those usual metacharacters, your regular expressions will work just as well, but this will reduce readability.
In the character set definition, in order to use a backslash "\" as a literal character instead of a special meaning character, you need to escape it with another backslash. <<[\\x]>> will match one backslash and one x. "]^-" can be escaped with backslashes, or put them in a position where they cannot be used to their particular meaning. We recommend the latter because it will increase readability. For example, for the character "^", put it in addition to the opening parenthesis "[" after the position, using the literal character meaning rather than the inverse meaning. such as <<[x^]>> will match an X or ^. <<[]x]>> will match a "]" or "X". <<[-x]>> or <<[x-]>> will match a "-" or "X".
- Shorthand for character sets
Because some character sets are very common, there are some shorthand methods.
<<\d>> Representative <<[0-9]>>;
<<\w>> represents a word character. This is different with the regular expression implementation. The majority of the regular expression implementations of the word character set contain <<A-Za-z0-9_]>>.
<<\s>> stands for "White characters". This is also related to different implementations. In most implementations, space and tab characters are included, and the carriage return line break <<\r\n>>.
The abbreviated form of a character set can be used inside or outside of square brackets. <<\s\d>> matches a white character followed by a number. <<[\s\d]>> matches a single white character or number. <<[\da-fA-F]>> will match a hexadecimal digit.
To take a shorthand for the inverse character set
<<[\S]>> = <<[^\s]>>
<<[\W]>> = <<[^\w]>>
<<[\D]>> = <<[^\d]>>
- Repetition of character sets
If you repeat a character set with the "? *+" operator, you will repeat the entire character set. And not just the character that it matches. Regular expression <<[0-9]+>> matches 837 and 222.
If you just want to repeat the character that is matched, you can use the backward reference to achieve the goal. We'll talk about backwards references later.
6. Repeat with? * or +
?: tells the engine to match the leading character 0 or one time. The fact is that the leading character is optional.
+: tells the engine to match leading characters 1 or more times
*: tells the engine to match leading characters 0 or more times
<[A-Za-z][A-Za-z0-9]*> matches HTML tags without attributes, "<" and ">" are text symbols. The first character set matches one letter, and the second character set matches one letter or number.
We also seem to be able to use <[A-Za-z0-9]+>. But it will match <1>. But this regular expression is still valid enough when you know that the string you are searching for does not contain a similar invalid tag.
Many modern regular expression implementations allow you to define how many times a character repeats. The lexical is: {Min,max}. Min and Max are non-negative integers. If the comma is there and Max is ignored, Max has no limit. If both the comma and Max are ignored, repeat the min time.
So {0,} and * are the same as {1,} and +.
You can use <<\b[1-9][0-9]{3}\b>> to match the number between 1000~9999 ("\b" represents the word boundary). <<\b[1-9][0-9]{2,4}\b>> matches a number between 100~99999.
Suppose you want to match an HTML tag with a regular expression. You know that the input will be a valid HTML file, so the regular expression does not need to exclude those invalid tags. So if the content is between two angle brackets, it should be an HTML tag.
Many beginners of regular expressions will first think of using regular expressions << <.+> >>, and they will be surprised to find that for test strings, "This is a <EM>first</EM> test", You may expect to return to the <em>, and then go back to the match when you return to </EM>.
But the truth is not. The regular expression will match "<EM>first</EM>". Obviously this is not the result we want. The reason is that "+" is greedy. That is, "+" causes the regular expression engine to try to repeat the leading characters as much as possible. The engine will backtrack only if this repetition causes the entire regular expression match to fail. That is, it discards the last repetition and then processes the rest of the regular expression.
Like "+", the repetition of "? *" is also greedy.
- Deep inside the regular expression engine
Let's take a look at how the regular engine matches the previous example. The first sign is "<", which is a text symbol. The second symbol is ".", which matches the character "E", and then "+" can always match the rest of the characters until the end of the line. And then to the line break, the match fails ("." Does not match the line break). The engine then begins to match the next regular expression symbol. Also try to match ">". So far, "<.+" has been matched with "<EM>first</EM> test". The engine tries to match the ">" with the newline character, and the result fails. The engine then retraced. The result is now "<.+" matches "<EM>first</EM> tes". The engine then matches ">" with "T". Obviously, it will fail. This process continues until "<.+" matches "<em>first</em", ">" matches ">". So the engine found a match "<EM>first</EM>". Remember, the regular-oriented engine is "desperate", so it will be anxious to report the first match it finds. Rather than continue backtracking, even if there may be a better match, such as "<EM>". So we can see that because of the greed of "+", the regular expression engine returns the longest match on the left.
- Lazy sex instead of greed
One possible solution to correct the above problem is to replace greed with the inertia of "+". You can follow the "+" followed by a question mark "?" To achieve this. "*", "{}" and "?" The repetition of the expression can also be used in this scheme. So in the example above we can use "<.+?>". Let's take a look at the process of the regular expression engine.
Again, the regular expression notation "<" matches the first "<" of a string. The next regular token is ".". This time is a lazy "+" to repeat the last character. This tells the regular engine to repeat the last character as little as possible. So the engine matches "." and the character "E", and then ">" to Match "M", the result failed. The engine will backtrack, unlike the previous example, because it is lazy repetition, so the engine is extending lazily instead of decreasing, so "<.+" is now extended to "<em". The engine continues to match the next tick ">". This time got a successful match. The engine then reported "<EM>" to be a successful match. The whole process is roughly the same.
- An alternative to lazy scaling
We also have a better alternative. You can use a greedy repetition with an inverse character set: "<[^>]+>". This is a better solution when using lazy repetition, the engine will backtrack on each character before it finds a successful match. The use of the inverse character set does not require backtracking.
Finally, keep in mind that this tutorial is just about a regular-oriented engine. The text-oriented engine is not backtracking. However, they also do not support lazy repeating operations.
7. Use "." Match almost any character
In the regular expression, the "." is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols.
“.” Match a single character without worrying about what characters are matched. The only exception is the new line character. The engines that are discussed in this tutorial do not match the new line characters by default. So by default, the "." is equal to the shorthand for the character set [^\n\r] (Window) or [^\n] (Unix).
This exception is due to historical reasons. Because the tools used to use regular expressions earlier are row-based. They all read a file in a row and apply the regular expression to each row. In these tools, the string does not contain a new line character. So "." The new line character is never matched.
Modern tools and languages can apply regular expressions to very large strings or even entire files. All the regular expression implementations discussed in this tutorial provide an option to make the "." Matches all characters, including new line breaks. In tools such as Regexbuddy, EditPad Pro or powergrep, you can simply select "Dot Match new line break". In Perl, the "." Patterns that can match new line breaks are called single-line mode. Unfortunately, this is a very confusing noun. Because there are also so-called "multi-line mode". Multiline mode affects only the anchor (anchor) at the end of the line, and the single-line mode affects only ".".
Other languages and regular expression libraries also use Perl's term definitions. When using a regular expression class in the. NET framework, you can activate single-line mode with a statement similar to the following: Regex.match ("string", "Regex", Regexoptions.singleline)
- Conservative use of the dot "."
The dot can be said to be the most powerful meta-character. It allows you to be lazy: you can match almost any character with a dot number. The problem, however, is that it often matches characters that should not be matched.
I will explain it in a simple example. Let's see how to match a date with a "mm/dd/yy" format, but we want to allow the user to select the delimiter. One of the options that will soon come to mind is <<\d\d.\d\d.\d\d>>. It looks like it can match the date "02/12/03". The problem is that 02512703 is also considered a valid date.
<<\d\d[-/.] \d\d[-/.] \d\d>> seems to be a better solution. Remember that the dot number is not a meta character in a character set. This scheme is far from perfect, it will match "99/99/99". and <<[0-1]\d[-/.] [0-3]\d[-/.] \d\d>> went further. Although he will also match "19/39/99". The degree to which you want your regular expression to be perfect depends on what you want to achieve. If you want to verify user input, you need to be as perfect as possible. If you just want to analyze a known source, and we know there's no wrong data, it's enough to match the character you want to search with a better regular expression.
8. Anchoring of string start and end
The anchor is different from the regular expression symbol, and it does not match any characters. Instead, they match the position before or after the character. "^" matches the position in front of the first character of a line string. <<^a>> will match a in the string "abc". <<^b>> will not match any of the characters in "ABC".
Similarly, $ matches the position after the last character in the string. So <<c$>> matches the C in "ABC".
It is important to use anchoring when the user input is tested in the programming language. If you want to verify that the user input is an integer, use <<^\d+$>>.
In user input, there are often extra leading or ending spaces. You can use <<^\s*>> and <<\s*$>> to match leading spaces or end spaces.
- Use "^" and "$" as the beginning and end of the line anchoring
If you have a string that contains more than one line. For example: "First Line\n\rsecond line" (where \n\r represents a newline character). It is often necessary to process each line separately instead of the entire string. As a result, almost all regular expression engines provide an option to extend the meaning of both anchors. "^" can match the starting position of the string (before F), as well as the trailing position of each new line character (between \n\r and s). Similarly, $ matches the end position of the string (after the last e), and the front of each new line character (between E and \n\r).
In. NET, when you use the following code, you will define the anchor to match the front and back positions of each new line character: Regex.match ("string", "Regex", Regexoptions.multiline)
Application: String str = Regex.Replace (Original, "^", ">", Regexoptions.multiline) – the ">" will be inserted at the beginning of each line.
<<\A>> matches only the starting position of the entire string,<<\z>> matches only the end position of the entire string. Even if you use multiline mode,,<<\a>> and <<\Z>> never match new line characters.
Even if \z and $ only match the end position of the string, there is still an exception to the situation. If the string ends with a new line break, \z and $ will match the position in front of the new line character, not the last face of the entire string. This "improvement" was introduced by Perl and followed by a number of regular expression implementations, including Java. NET, and so on. If you apply <<^[a-z]+$>> to "joe\n", the match result is "Joe" instead of "joe\n".
PHP Regular Expression (i)