Linux Regular expressions

Source: Internet
Author: User
Tags modifier expression engine

Introduction to Regular expressions 1. What is a regular expression

Regular Expressions (Regluar Expressions), also known as regular expressions, are initially popularized by tools software (such as SED and grep) in Unix. Regular expressions are often abbreviated in code as res,regexes or RegExp (regex patterns). It is essentially a small, highly specialized programming language. many programming languages support string manipulation through regular expressions. For example, in Perl, a powerful regular expression engine is built in.

2. What regular expressions can do

The primary application of regular expressions is text, and regular expressions allow you to specify the string rules that you want to match, and then use this rule to match, find, replace, or cut the text that matches the specified rule. In general, regular expressions can implement the following functions for the specified text:

    • Match Validation: determines whether a given string conforms to the filtering rules specified by the regular expression, thus determining whether the contents of a string conform to a specific rule (such as an email address, mobile phone number, etc.), and when the regular expression is used for matching validation It is usually necessary to add ^ and $ to the header and tail of the regular expression string to match the entire string to be validated.

    • Find and Replace: determines whether the given string contains a substring that satisfies the matching rule specified by the regular expression, such as finding the included IP address in a piece of text. Alternatively, you can replace the found substring with the content.

    • string Segmentation and substring interception: substring-based lookups can also split a given string as a delimiter with a string of matching rules specified by a regular expression.

Second, the characters in the regular expression

The main application of regular expressions is text, the most basic function is text matching, and the text is composed of one character, so the regular expression is actually a match to the character. The characters in a regular expression are divided into ordinary characters and metacharacters , and regular expressions are expressions that combine these ordinary characters with special metacharacters to represent a particular matching rule.

1. Ordinary characters

In fact, most characters will simply match their own values, which are called ordinary characters, such as numbers (0-9), letters (A-Z, A-Z), and so on. For example, the regular expression hello123 will match the string ' hello123 ' because the expression is a normal character and does not contain a special meta-character. Of course, we can specify that the regular expression matches the pattern to ignore the letter case pattern, so that the regular expression hello123 will be able to match strings such as ' Hello123 ', ' HellO123 ', ' HELLO123 ', and so on.

tip: In fact, we do not need to remember which characters are ordinary characters, we just need to know which characters are special metacharacters, except for the special meta-character all characters are ordinary characters.

2. Meta-characters

As mentioned above, regular expressions can also be fuzzy matched based on the specified rules, in addition to matching the characters themselves. This means that it needs special characters to represent these fuzzy matching rules, so these special characters do not, by default, match their own literal values, but rather represent certain special functions. These special meta characters include:., [,], (,), *, +,?, ^, $, \, |. The use of these special characters will be explained in detail below. The focus and difficulty of regular expressions is also the principle of how the regular expression engine works and the mastery and flexibility of these special meta-characters.

Hint: What if you want to match the literal value of these special metacharacters themselves? We can escape the other special characters by one of the special characters, so that we can match the literal value of these special characters.

Three, meta-character explanation

Now let's elaborate on what complex matching functions the special meta characters in the regular expression can accomplish.

1. Single Character matching

Note: all special characters within [] will lose their original special meaning:

  • Some special characters are given a new special meaning in [], such as ' ^ ' appearing in [] at the beginning of the position of the inverse, it appears in [] the other place to represent itself (into a normal character);

  • Some special characters become ordinary characters, such as '. ', ' * ', ' + ', '? ', ' $ '

  • Some ordinary characters become special characters, such as '-' where the position in [] is not the first character represents a number or an alphabetic range, if the position in [] is the first character then it represents itself (a normal character)

  • In [], if you want to use '-', ' ^ ' or '] ', precede them with a backslash, or '-', ' ' ' in the position of the first character, place ' ^ ' in a position other than the first character.

2. Predefined character sets

We can follow a backslash followed by a specified letter to represent a predefined set of characters

3. Number of characters match--quantifier

In regular expressions, we can also specify the number of occurrences of a character

We can draw the following conclusions:

    • {0,1} or {, 1} equivalent to?

    • {1,} equivalent to +

    • {0,} equivalent to *

We prefer to use?, + and *, because they are simple to write, and can make the entire regular expression concise.

Description: ? The word regular is expressed in conjunction with, +, *, {m,n}, there is also an additional function, that is, the matching pattern from the greedy mode (as far as possible to increase the number of matches) into a non-greedy mode (to minimize the number of matches), this will be described in detail in the following content.

4. Boundary matching

A regular expression can also match a boundary position, such as the beginning or end of a string, the beginning or end of a word.

5. Logic and Grouping

6. Special construction

Description: The above-mentioned "do not consume string content" refers to just matching, but does not move the matching position of the original string, so that multiple matches can be completed. The following is an example of a regular expression that matches a password, which is accomplished with this feature.

Iv. examples of common regular expressions

It is often time-consuming to write a suitable regular expression, so we can keep some common regular expressions for a rainy day. But it is necessary to note that no one dares to say that their regular expressions are strictly rigorous, and there is not exactly the same matching requirements, so here are just a few of my own write a few common regular expressions, the environment we leave a message to discuss.

Note: Here are just a few simple matching rules, in fact, we need to follow the specific circumstances of these regular expressions of the header and the end of the corresponding boundary characters, such as: ^, $, \a, \z, \b, \b, etc.

Match a network address (URL)
[a-za-z]+://[\s]+

It is necessary to note that the network address is not necessarily a Web address (HTTP or HTTPS link), it may be an FTP address, and so on. If we want to match the network address of a particular protocol, such as an HTTP or HTTP link, you can write:

(https?:/ /)? [\s]+
Match an IP address

The simplest notation:

(\d+[.]) {3}\d+

A strict wording:

((?: [1-9]\d?) | (?: 1[0-9]2) | (?: 2[0-4][0-9]) | (?: 25[0-5])) [.]) {3} (?: [1-9]\d?) | (?: 1[0-9]2) | (?: 2[0-4][0-9]) | (?: 25[0-5]))
Match an email address

The simplest notation:

\[email protected]\s+\.\s+

A strict notation (guaranteed to have only one @ character):

[^\[email protected]] [Email protected] [^\[email protected]]+\. [^\[email protected]]+

If you want to be very rigorous, it is necessary to distinguish between different mailboxes, because NetEase (126 mailbox, 163 mailbox), QQ mailbox, Hotmail mailbox and Gmail mailbox on the mailbox name can contain the characters have different requirements.

Match NetEase Email: 6-18 characters, can only contain letters, numbers and underscores, and can only begin with a letter

[A-za-z]\w{5,17}@ (126|163) \.com

Match QQ mailbox: 3-18 characters, can only contain letters, numbers, dots, minus signs and underscores

[\w-.] {3,18} @qq \.com

If you want multiple mailbox rigorous match with a regular expression to match, such as to match NetEase mailbox and QQ mailbox can write:

(?: [a-za-z]\w{5,17}@ (126|163) \.com) | (?: [\w-.] {3,18} @qq \.com)

Of course, you can also match more than one regular expression, and then pass the program logic to the final result

Match password is legal:

Requires a simpler case, such as requiring only non-null characters and a limited password length of 6-18 bits

^\s[6-18]$

Requirements for more complex situations, such as the need to include numbers, letters of size, lowercase, and punctuation, this requires the use of the preceding regular expression of the special structure (? = ...), (?! =...)

(?=^. {6,8}$) (? =.*\d) (? =.*[a-z]) (? =.*[a-z]) (? =.*\w+)

If the requirement must be both inclusive and contain only numbers, uppercase and lowercase letters, and punctuation, you can write:

(?=^. {6,8}$) (? =.*\d) (? =.*[a-z]) (? =.*[a-z]) (? =.*\w+) (?!. *[^\D\WA-ZA-Z])
Match Mainland XXX number (15-bit or 18-bit)
\d{15}| (\d{18}| (\d{17}[xx]))

Tip: today's XXX numbers are available in 15-bit and 18-bit points. 1985, the implementation of the resident XXX system, the issue of the XXX number is 15, 1999 issued by the XXX due to the expansion of the year (from two to four) and the end of the validation code, it became 18 bits. These two XXX numbers will coexist for quite a long period of time. The meanings of the two XXX numbers are as follows:

Match date (year-month-day)
(\d{2}|\d{4})-((0?[ 1-9]) | (1[0-2])) -((0? [1-9]) | ([12][0-9]) | (3[01]))
24-hour time (hours: minutes: seconds)
(((0?| 1) [0-9]) | (2[0-3])):( [0-5] [0-9]):( [0-5] [0-9])
Other common regular expressions
Match Content Regular Expressions
QQ number [1-9]\d{4,}
Fixed telephone number in mainland China (\d{3,4}-)? \d{7,8}
Mobile phone number in mainland China 1\D{10}
China Mainland Postcode \D{6}
Chinese characters [\u4e00-\u9fa5]
Chinese and full-width punctuation [\u3000-\u301e\ufe10-\ufe19\ufe30-\ufe44\ufe50-\ufe6b\uff01-\uffee]
Words that do not contain ABC (? =\w+) (?! abc
Positive integers [1-9]+
Negative integer -[1-9]+
Non-negative integers (positive integers +0) [1-9]+
Non-positive integer (negative integer +0) -[1-9]+
Integer +0 -? [1-9]+
Positive floating point number \d+.\d+
Negative floating point number -\d+.\d+
Floating point number -?\d+.\d+

Again, we need to add the corresponding boundary characters to the first and the end of these regular expressions according to the actual situation, such as: ^, $, \a, \z, \b, \b, etc.

Five, the matching process of regular expression

Character repetition matching based on quantifiers (such as, +, *, {m,n}, {m,}) is an important aspect of regular expressions over ordinary string processing and is an important part of regular expressions. Quantifiers have a very significant influence on the matching process of regular expressions, so when we introduce the matching process of regular expressions, we must mention two important classifications of quantifiers:

    • Match Priority quantifiers the quantifier we described above is the match-first quantifier includes:?, +, *, {m,n}, but not {m}

    • ignoring a priority quantifier after matching a priority quantifier adds a question mark to ignore the precedence quantifiers, including:??, +?, *?, {m,n}?

If you are not familiar with these two words, then you must have heard of these two words:

    • greedy mode (or non-lazy matching) as the name implies, is to match the characters modified by quantifiers as much as possible if the whole expression matches successfully.

    • The non-greedy mode (or lazy match) matches the character modified by the quantifier as little as possible if the whole expression matches successfully.

The relationship between them is:

    • The sub-expressions that match the modifier of the precedence quantifier use the greedy pattern (non-lazy matching);

    • The sub-expression that ignores the modifier of the precedence quantifier uses the pattern is the non-greedy pattern (lazy match);

In other words, greedy and non-greedy patterns we analyze the regular matching process in greedy and non-greedy modes with an example:

    • String to match: ' ABCBD '

    • Greedy pattern Regular expression:a[bcd]*b

    • Non-greedy mode regular expression:a[bcd]*?b

1. Greedy Pattern matching process analysis

2. Non-greedy pattern matching process analysis

3. Summary

greedy mode and non-greedy mode affect the matching behavior of sub-expressions modified by quantifiers , and greedy mode matches as much as possible when the whole expression matches successfully, and the non-greedy mode matches as few as possible if the whole expression matches successfully. In addition, non-greedy mode is only supported by partial NFA engines. In terms of matching efficiency, when the same matching result can be achieved, the matching efficiency of the talk pattern is usually higher, because the backtracking process will be relatively small.

4. Supplementary examples

Accidentally see a good example of the greedy pattern matching process, share to everyone. The example is from this article

    • First, the "<" to get control, by the position of 0-bit start to try to match, match the character "a", matching failed, the first round of matching end. The second match starts at position 1 and tries to match, and the same match fails. The third round starts with position 3, matches the character "<", matches the success, and control is given to "D".

    • "D" attempts to match the character "D", the match succeeds, and control is given to "I". Repeat the process until the ">" is matched to the character ">" and control is given to ". *".

    • ". *" is greedy mode, will start from the character "T" after B, always match to E, that is, the end of the string, the control to "<".

    • "<" tries to match from the end of the string, matches the failure, looks forward to the state for backtracking, and gives control to ". " , by"." Give a character "C", give control to "<", try to match, match the failure, and look forward to the state that is available for backtracking. Repeat the process until the ". *" yields the matched character "<", which in effect yields the matched substring "CC" until the "<" matches the character "<" succeeds and control is given to "/".

    • Next, "/", "D", "I", "V" match the corresponding characters successfully, at this time the entire regular expression matches complete.

Six, the flag bit-flag in the regular expression

The above mentioned greedy pattern and non-greedy mode affect the matching behavior of sub-expressions modified by quantifiers, and the flag bit here will affect the overall working method of regular expressions. There are usually preset constant values in different programming languages to represent these flags, and you can check the documents yourself when you use them. The commonly used flags are as follows:

Flag Position Function Description
Flag bits that indicate ignoring case By default, regular expressions are case-sensitive when they are matched
Represents a flag bit that matches any character This flag bit affects the '. ' This meta-character, as it is by default, matches any character except a newline, when the flag bit is specified, '. ' will be able to match any character
Flag bits that represent multiple lines of matching It affects the two metacharacters that are ' ^ ' and ' $ ', which by default match the beginning and end of a string, after specifying this flag bit, they can match the beginning of each line and the end of the line
Vii. references
      • Https://docs.python.org/3.5/howto/regex.html

      • http://blog.csdn.net/lxcnn/article/details/4756030

      • A classic summary chart (the collection for a long time, forget the source, if any know please tell, here will attach the link address, thank you. )


Linux Regular expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.