Regular Expression [reprinted]

Source: Internet
Author: User
Tags printable characters repetition expression engine



I was interested in regular expressions six months ago. I found a lot of information on the Internet, read a lot of tutorials, and finally found that his tutorial was very well written when I used regexbuddy, a regular expression tool, it can be said that it is the best regular expression tutorial I have ever seen. So I always wanted to translate it. This wish will not be realized until this May Day holiday.Article. The name of this article seems to have been too common. However, after reading the original article, I feel that only by using "in-depth and simple" can I express the experience this tutorial has given me, so I cannot avoid it.

This document is a tutorial written by Jan goyvaerts for regexbuddy.The copyright belongs to the original author. You are welcome to reprint it. However, to respect the work of the original author and the translator, please indicate the source! Thank you!

1. What is a regular expression?

Basically, a regular expression is a pattern used to describe a certain number of texts. RegEx represents regular express. This article uses <RegEx> to represent a specific regular expression.

A piece of text is the most basic mode. It simply matches the same text.

2. Different Regular Expression Engines

The Regular Expression Engine is a software that can process regular expressions. Generally, engines are larger applications.Program. In the software world, different regular expressions are not compatible with each other. This tutorial will focus on Perl 5 engines, which are the most widely used engines. At the same time, we will also mention some differences with other engines. Many modern engines are similar, but not identical. For example, the. NET Regular Expression Library and JDK Regular Expression package.

3. Text symbols

The most basic regular expression is composed of a single text symbol. For example, <A>, it matches the first character "A" in the string ". For example, for the string "Jack is a boy ". "A" after "J" will be matched. The second "A" won't be matched.

The regular expression can also match the second "A", which must be the place where you tell the Regular Expression Engine to start searching from the first match. In the text editor, you can use "find next ". InProgramming LanguageIn, there will be a function that allows you to continue to search backward from the position where the previous match was made.

Similarly, <cat> matches "cat" in "about cats and dogs ". This tells the Regular Expression Engine to find a <C>, followed by a <A>, followed by another <t>.

Note that the Regular Expression Engine is case sensitive by default. <Cat> does not match "cat" unless you tell the engine to ignore case sensitivity ".

· Special characters

11 characters are reserved for special purposes. They are:

[] \ ^ $. |? * + ()

These special characters are also called metacharacters.

If you want to use these characters in a regular expressionCompositionThis character, you need to use the Backslash "\" to change its code (escape ). For example, if you want to match "1 + 1 = 2", the correct expression is <1 \ + 1 = 2>.

<1 + 1 = 2> is also a valid regular expression. But it does not match "1 + 1 = 2", but will match "123 = 2" in "111 + 234 = 111 ". Because "+" represents a special meaning here (repeated once to multiple times ).

In programming languages, note that some special characters are first processed by the compiler and then passed to the Regular Expression Engine. Therefore, the regular expression <1 \ + 2 = 2> must be written as "1 \ + 1 = 2" in C ++ ". To match "C: \ Temp", you must use a regular expression <c: \ Temp>. In C ++, the regular expression is changed to "C :\\\ Temp ".

· Non-printable characters

Special character sequences can be used to indicate certain non-printable characters:

<\ T> tab (0x09)

<\ R> represents the carriage return (0x0d)

<\ N> represents a line break (0x0a)

Note that in windows, "\ r \ n" is used to end a line, while "\ n" is used for Unix ".

4. Internal Working Mechanism of the Regular Expression Engine

Knowing how the Regular Expression Engine works helps you quickly understand why a regular expression does not work as expected.

There are two types of engines: The text-directed engine and the RegEx-directed engine. Jeffrey Friedl calls them the DFA and NFA engines. This article talks about the regular expression-oriented engine. This is because some very useful features, such as lazy quantifiers and backreferences, can only be implemented in the regular expression-oriented engine. So it is not surprising that this engine is currently the most popular.

You can easily tell whether the engine is text-oriented or regular-expression-oriented. If reverse references or "inert" quantifiers are implemented, you are sure that the engine you are using is regular-oriented. You can perform the following test: Apply the regular expression <RegEx | RegEx not> to the string "RegEx not ". If the matching result is RegEx, the engine is regular-oriented. If the result is RegEx not, It is text-oriented. Because the regular expression-oriented engine is a "monkey" engine, it will be eager to make an expression and report the first matching it finds.

· The regular expression-oriented engine always returns the leftmost match.

This is a very important point you need to understand: Even if you may find a "better" match in the future, the regular expression-oriented engine always returns the leftmost match.

When <cat> is applied to "he captured a catfish for his cat", the engine first compares <C> and "H", and the result fails. Therefore, the engine fails to compare <C> and "E. Until the fourth character, <C> matches "C ". <A> the fifth character is matched. The sixth character <t> failed to match "p. The engine re-checks the matching from the fifth character. The Regular Expression Engine is eager to return the first matching result when <cat> matches "cat" in "catfish" and starts with 15th characters, and will not continue to find whether there are other better matches.

5. Character Set

Character Set is a character set that is enclosed by a pair of square brackets. Using character sets, you can tell the Regular Expression Engine to match only one of multiple characters. If you want to match a "A" or an "e", use <[AE]>. You can use <gr [AE] Y> to match gray or gray. This is especially useful when you are not sure whether the character you want to search for is in American or English. Conversely, <gr [AE] Y> does not match graay or graey. The character sequence in the character set is irrelevant and the results are the same.

You can use the hyphen "-" to define a character range as a character set. <[0-9]> match a single number between 0 and 9. You can use more than one dimension. <[0-9a-fa-f]> matches a single hexadecimal number and is case insensitive. You can also combine the range definition with a single character definition. <[0-9a-fxa-Fx]> match a hexadecimal number or letter X. Again, the sequence of characters and range definitions does not affect the result.

· Application of character sets

Find a word that may have misspelled characters, such as <Sep [AE] R [AE] te> or <li [Cs] En [Cs] E>.

Find the identifier of the language, <A-Za-Z _] [A-Za-z_0-9] *>. (* Indicates repeated 0 or multiple times)

Find the hexadecimal number of the C style <0 [XX] [A-Fa-f0-9] +>. (+ Indicates repeat once or multiple times)

· Retrieving the inverse Character Set

The character set is reversed when the left square brackets ([) are followed by an angle bracket (^. The result is that the character set matches any character that is not in square brackets. Unlike ".", the anti-character set can match the carriage return line break.

It is important to remember that a character must be matched to the anti-character set. <Q [^ u]> does not mean that Q is matched and no u is followed. It means: match a Q, followed by a character not U. Therefore, it will not match Q in "Iraq", but will match Q in "Iraq is a country" and a space character. In fact, a space character is a part of the match because it is a "not a U character ".

If you only want to match a Q, the condition is that Q is followed by a character that is not U. We can solve this problem by looking forward as described later.

· Metacharacters in character sets

Note that only four characters in the character set have special meanings. They are: "] \ ^ -". "]" Indicates the end of the character set definition; "\" indicates the escape, "^" indicates the inverse, and "-" indicates the range definition. Other common metacharacters are normal characters in the character set definition and do not need to be escaped. For example, to search for asterisks * or plus signs +, you can use <[+ *]>. Of course, if you escape common metacharacters, your regular expression will work well, but this will reduce readability.

In character set definition, to use the Backslash "\" as a character rather than a special character, you need to use another backslash to escape it. <[\ X]> A backslash and an X are matched. "] ^-" Can be escaped by backslash, or placed in a position that cannot be used to their special meaning. We recommend the latter because it increases readability. For example, if the character "^" is placed after the left bracket "[", it uses the text character meaning rather than the inverse meaning. For example, <[x ^]> matches an X or ^. <[] X]> A "]" or "X" is matched ". <[-X]> or <[X-]> match a hyphen (-) or hyphen (X ".

· Character Set abbreviations

Some character sets are very common, so there are some shorthand methods.

<\ D> representative <[0-9]>;

<\ W> represents a word character. This varies with the implementation of regular expressions. Most of the character sets implemented by regular expressions contain <A-Za-z0-9 _]>.

<\ S> indicates "white characters ". This is also related to different implementations. In most implementations, space characters, TAB characters, and carriage return linefeeds <\ r \ n> are included.

Character Set abbreviations can be used within or out of square brackets. <\ S \ D> match a white character followed by a number. <[\ S \ D]> match a single white character or number. <[\ Da-fa-F]> A hexadecimal number is matched.

Abbreviation of the inverse Character Set

<[\ S] >>=< <[^ \ s]>

<[\ W] >>=< <[^ \ W]>

<[\ D] >=< <[^ \ D]>

· Repeated character sets

If you use "? * + "Operator to repeat a character set. You will repeat the entire character set. It is not only the character it matches. The regular expression <[0-9] +> matches 837 and 222.

If you only want to repeat the matched character, you can use backward reference for the purpose. We will talk about backward reference later.

6. Use? * Or +

? : Tells the engine to match the leading character 0 times or once. In fact, it indicates that the leading character is optional.

+: Tell the engine to match the leading character once or multiple times

*: Tells the engine to match the leading character 0 or multiple times

<[A-Za-Z] [A-Za-z0-9] *> matches HTML tags without attributes, and "<" and ">" are text symbols. The first character set matches a letter, and the second character set matches a letter or number.

We seem to be able to use <[A-Za-z0-9] +>. But it will match <1>. However, this regular expression is valid when you know that the string you want to search for does not contain similar invalid tags.

· Duplicate restrictions

Many modern regular expression implementations allow you to define how many times a character is repeated. Lexical: {min, max }. Both min and Max are non-negative integers. If a comma exists and Max is ignored, Max is not restricted. If both comma and Max are ignored, repeat the time in minutes.

Therefore, {0,} is the same as *, and {1,} is the same as +.

You can use <\ B [1-9] [0-9] {3} \ B> to match 1000 ~ A number between 9999 ("\ B" indicates the word boundary ). <\ B [1-9] [0-9] {2, 4} \ B> match a value between 100 and ~ A number between 99999.

· Be greedy

Suppose you want to use a regular expression to match an HTML Tag. You know that the input will be a valid HTML file, so regular expressions do not need to exclude invalid tags. Therefore, if the content is between two angle brackets, it should be an HTML Tag.

Many new users will first consider using regular expressions <. + >>>, they are surprised to find that for the test string, "This is a <em> first </em> test", you may expect to return <em>, then, when the matching continues, return </em>.

But it does not. The regular expression matches "<em> first </em> ". Obviously, this is not the result we want. The reason is that "+" is greedy. That is to say, "+" will cause the Regular Expression Engine to try to repeat leading characters as much as possible. The engine performs backtracking only when this type of repetition causes the entire regular expression to fail to match. That is to say, it will discard the last "repeat" and then process the remaining part of the regular expression.

Like "+", "? * "Repetition is greedy.

· Go deep into the Regular Expression Engine

Let's take a look at how the Regular Expression Engine matches the previous example. The first mark is "<", which is a text symbol. The second symbol is ".", matches the character "E", and "+" can always match other characters until the end of a line. Then the linefeed fails to match ("." does not match the linefeed ). The engine starts to match the next regular expression symbol. That is, try to match "> ". So far, "<. +" has matched "<em> first </em> test ". The engine tries to match ">" with the linefeed and the result fails. The engine traces back. The result is that "<. +" matches "<em> first </em> tes ". Therefore, the engine matches ">" with "T. Obviously, it will still fail. This process continues until "<. +" matches "<em> first </em", ">", and ">. Therefore, the engine finds a match "<em> first </em> ". Remember, the regular expression-oriented engine is "eager", so it will rush to report the first match it finds. Rather than continue tracing, even if there may be better matching, such as "<em> ". Therefore, we can see that due to the greedy nature of "+", the Regular Expression Engine returns a leftmost longest match.

· Replacing greed with laziness

One possible solution for correcting the above problems is to replace greed with "+" inertia. You can follow "+" with a question mark "?" To achieve this. "*", "{}" And "?" This scheme can also be used for repeated representation. Therefore, in the preceding example, we can use "<. +?> ". Let's take a look at the processing process of the Regular Expression Engine.

Again, the regular expression mark "<" matches the first "<" of the string ". The next regular mark is ".". This is a lazy "+" to repeat the previous character. This tells the Regular Expression Engine to repeat the previous character as few as possible. Therefore, the engine matches "." And the character "E", and then matches "M" with ">". The result fails. The engine performs backtracking, which is different from the previous example. Because it is a inertia repetition, the engine expands the inertia repetition rather than reduces, so "<. +" is now extended to "<em ". The engine continues to match the next mark "> ". A successful match is obtained this time. The engine reports "<em>" as a successful match. The entire process is roughly the same.

· An alternative to inertia Scaling

We also have a better alternative. You can use a greedy repeat with an anti-Character Set: "<[^>] +> ". This is a better solution. When the inertia repeat is used, the engine will backtrack each character before finding a successful match. However, you do not need to perform backtracking when using the anti-character set.

The last thing to remember is that this tutorial only talks about the regular expression-oriented engine. The text-oriented engine does not trace back. At the same time, they do not support inert and repetitive operations.

7. Use "." To match almost any character

In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most vulnerable symbols to misuse.

"." Matches a single character without worrying about the character to be matched. The only exception is the newline character. The engine mentioned in this tutorial does not match the new line character by default. Therefore, by default, "." is equivalent to the abbreviation of the character set [^ \ n \ r] (window) or [^ \ n] (UNIX.

This exception is due to historical reasons. Because the regular expression-based tools were used in the early days. They all read a file in one row and apply the regular expression to each row. In these tools, strings do not contain newline characters. Therefore, "." Never matches new line characters.

Modern tools and languages can apply regular expressions to large strings or even entire files. All regular expression implementations discussed in this tutorial provide an option to make "." match all characters, including new line characters. In regexbuddy, editpad pro, powergrep, and other tools, you can simply select "Point Matching newline ". In Perl, the pattern that "." can match a newline is called "Single Line Pattern ". Unfortunately, this is a confusing term. Because there is also the so-called "multiline mode ". The multi-row mode only affects anchor at the beginning and end of the line, while the single-row mode only affects ".".

Other languages and Regular Expression Libraries also use Perl terminology. When using a regular expression class in. NET Framework, you can use a statement similar to the following to activate the single-row mode: RegEx. Match ("string", "RegEx", regexoptions. singleline)

· Conservative use of the "."

Point numbers can be said to be the most powerful metacharacters. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that do not match.

I will give a simple example. Let's see how to match a date in mm/DD/yy format, but we want to allow users to select separators. One solution that will soon come up with is <\ D. \ D. \ D>. It seems that it matches the date "02/12/03 ". The problem is that 02512703 is also considered a valid date.

<\ D [-/.] \ D [-/.] \ D> it looks like a better solution. Remember that the dot is not a metadatabase in a character set. This solution is far from perfect, and it will match "99/99/99 ". <[0-1] \ D [-/.] [0-3] \ D [-/.] \ D> goes further. Even though it matches "19/39/99 ". The degree to which you want your regular expression to be perfect depends on what you want to do. If you want to verify user input, you must do your best. If you only want to analyze a known source and we know that there is no error data, it is enough to use a better regular expression to match the characters you want to search.

8. String start and end anchor

The anchor is different from the regular expression symbol. It does not match any character. Instead, they match the positions before or after the characters. "^" Matches the position before the first character of a string. <^ A> match a in the string "ABC. <^ B> it does not match any character in "ABC.

Similarly, $ matches the position behind the last character in the string. Therefore, <C $> matches C in "ABC.

· Anchored applications

When verifying user input in programming languages, it is very important to use the anchor. If you want to verify that your input is an integer, use <^ \ D + $>.

Excessive leading or ending spaces are often displayed in user input. You can use <^ \ s *> and <\ s * $> to match leading or ending spaces.

· Use "^" and "$" as the anchor for starting and ending a row

If you have a string that contains multiple rows. For example, "First line \ n \ rsecond line" (where \ n \ r represents a new line character ). It is often necessary to process each line separately rather than the entire string. Therefore, almost all regular expression engines provide an option to extend the meanings of these two types of anchor. "^" Can match the start position (before F) of the string and the position (between \ n \ R and S) of each new line character ). Similarly, $ matches the end position of the string (after the last E) and the Front of each new line character (between E and \ n \ r ).

In. net, when you useCodeThe system will define the position before and after each new line character: RegEx. Match ("string", "RegEx", regexoptions. multiline)

Application: String STR = RegEx. Replace (original, "^", ">", regexoptions. multiline) -- inserts ">" at the beginning of each row ".

· Absolute anchoring

<\ A> only matches the start position of the entire string, <\ Z> only matches the end position of the entire string. Even if you use the multiline mode, the <\ A> and <\ Z> do not match the new line.

Even if \ Z and $ match only the end position of the string, there is still an exception. If the string ends with a new line character, \ Z and $ match the position before the new line character, rather than the end of the entire string. This "improvement" is introduced by Perl and followed by many regular expressions, including Java and. net. If the application is <^ [A-Z] + $> to "Joe \ n", the matching result is "Joe" instead of "Joe \ n ".



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.