Regular Expression Basics

Last Update:2015-05-07 Source: Internet

Author: User

Tags control characters expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Finally, csdn was chosen to sort out the knowledge points published over the past few years. This article was migrated to csdn in parallel. Because csdn also supports the markdown syntax, it's awesome!

[Craftsman's water: http://blog.csdn.net/yanbober]

I. Overview

Regular Expressions use a single string to describe and match a series of strings that conform to a certain syntax rule. In many text editors, regular expressions are usually used to retrieve and replace texts that match a certain pattern. Because regular expressions are mainly applied to text, they are used in various text editors, from the famous editor editplus to the large editor such as Microsoft Word and Visual Studio, you can use regular expressions to process text content. (PS: beginners generally think that regular expressions are crazy! Understanding, you will find him powerful !)

Given a regular expression and another string, we can achieve the following goals:

Whether the given string matches the filtering logic of the regular expression (called "match ");
You can use a regular expression to obtain the desired part from the string.

Regular Expressions are characterized:

Strong flexibility, logic, and functionality;
It can quickly achieve complex String Control in an extremely simple way.
It is difficult for new contacts.

Skills of force Installation Tools before learning:

1. regexpal is recommended for regular expression online verification tools.

Regexpal is a javascript-based online Regular Expression verification tool. In the input box above, we enter regular expressions (matching rules ), the input box below is for us to enter the data to be matched. In addition, you can set parameters such as case-insensitive and multi-row matching based on specific requirements.

2. regexbuddy3 is recommended for local Installation Tools.

Regexbuddy is a regular expression editing tool that helps you compile the required regular expressions. It can also be used to understand the expressions written by others.

3. Learning the forced tool regexr

Don't explain. Google it yourself.

Summary:

The tools are not explained much. They are all pediatric tools and can be done by yourself. Next we will continue to take you to the air.

Ii. pattern matching Basics Matching string literal value:

Original string: "Yanbo"
Regular Expression: "Yanbo"

The above is the simplest and most direct string matching, which is the simplest regular expression.

Matching Number:

Regular Expression: "\ D", "[0-9]", or "[0123456789]"
Original string: "3"

For example, the matching effect of the three regular expressions is the same. They all match a 0-9 number in the specified string, but only match a number. The three methods have their own advantages. "\ D" can represent any number, and "[M-N]" can represent a digit of M-n, "[ABCD]" can match a number in the specified ABCD. In particular, the "[0123456789]" matching "0123456789" string is a wrong idea. Pay special attention that it matches one character !!! You can also:

Regular Expression: "[015-7]"
Matching numbers: 0, 1, 5, 6, and 7

Match non-numeric characters:

Regular Expression: "\ D", "[^ 0-9]", or "[^ \ D]"
Match string: Non-numeric characters (remember: It also matches one character ).

For example, if the matching effect of the three regular expressions is the same, they all match a non-numeric character ." "^" In [] "is the inverse, excluding the content after" ^.

Matching word and non-word characters:

First, it must be emphasized that the matching is a word and a non-word character, not a word !!!

The abbreviated "\ W" will match all the characters (letters, numbers, and underscores ).
"\ D" matches non-numeric characters, including spaces, punctuation marks (quotation marks, hyphens, backslashes, square brackets), and other characters.
In English, \ W matches the same word character with [_ a-z0-9A-Z.
"\ W" matches non-word characters (such as spaces, punctuation marks, and other non-letters and numbers ).
In English, \ W matches the same word character with [^ _ a-z0-9A-Z.

More simplified characters are provided as follows. Please note that !!!!Not all regular expression processors can recognize the following shorthand.

Abbreviated characters	Description
\	Alarm character
[\ B]	Escape Character
\ C x	Control characters
\ D	Digit
\ D	Non-digit
\ W	Word character
\ W	Non-word characters
\ 0	NULL Character
\ X xx	Hexadecimal value of a character
\ U xxx	Unicode value of a character

Match blank and non-blank characters:

Regular Expression: "\ s" or "[\ t \ n \ r]"
Matching result: blank spaces (space, tab, line feed, and carriage return)

There must be at least one space before "\ t \ n \ r" in "[\ t \ n \ r]". Otherwise, the space is invalid.

Regular Expression: "\ s", "[^ \ s]", or "[^ \ t \ n \ r]"
Matching result: Non-blank characters (except spaces, tabs, line breaks, and carriage returns ).

In addition to the characters that match "\ s", there are also some less common blank spaces:

Abbreviated characters	Description
\ S	Blank Space Character
\ S	Non-blank characters
\ F	Page feed
\ H	Horizontal margin
\ H	Non-horizontal margin
\ N	Line Break
\ R	Carriage Return
\ T	Horizontal Tab
\ V	Vertical Tab
\ V	Non-vertical Tab

Match any character:

"." Matches all characters except the line terminator, except in some cases.

Regular Expression: "\ B \ W \ B"
Matching string: Three-character word

In the above expression, the "\ B" abbreviation matches the word boundary without consuming any characters. Generally, both boundary are written. The following is a special case.

Regular Expression:"a.c\."
Matching string: axc.

The above match is axc. Where X can be any character, and the last character is an escape character rather than ..

Actual installation force display:

After learning some entry-level regular expressions, We will install them, and the installation will also be forced. Therefore, the tools in the first chapter above are too simple, we use the SED stream editor in Linux to install the driver. (PS: if you are not familiar with the SED stream editor in Linux, Google or duniang ).

Here we will demonstrate a string "this is Yanbo's blog !" Example of HTML second-level title output.

Sed editor command:

echo "This is YanBo‘s Blog!" | sed ‘s/^/

 The preceding command is executed on a Linux terminal as follows: 
  
  Echo print "this is Yanbo's blog !" To the screen, and then use the "|" pipe to output as the SED input. 
  By default, sed directly copies input and output lines. 
  s/^/Add the HTML Level 2 title at the beginning of a row (^)Label.
   Semicolon (;) is used to separate commands.
 s/$/<\/h2>/Used to add HTML Level 2 titles at the end ($)Label.
 Command P to print the affected line.
 Run the Q command to end the SED program.
 
 
 Summary: In this part, we learn the basic matching of regular expressions, which is an entry-level skill. Next we will continue to take you to the air.
 Iii. Regular Expression-Boundary Warm-up preparation: The boundary part is one of the core regular expressions. The word assertion (zero-width assertion) is sufficient.
 Assertion (zero-width assertion) marks the boundary. It does not consume characters, does not match characters, and matches the position in the string.
 Start and end of a string or line: "^" Matches the starting position of the line or string, or the starting position of the entire document.
'$' Matches the end position of a row or string.
 Example:
Regular Expression: "^ word $"
Match string: Word (only a string of the word, starting with W and ending with D ).
 Word editing and non-word boundary: "\ Bxxx \ B" matches the word boundary.
 "\ B" is a zero-width assertion. On the surface, it matches spaces or the beginning of a row. In fact, it matches a zero-width assertion.
 "\ B" is a non-word matching boundary that matches positions other than words.
 Example:
Regular Expression: "\ Ba \ B"
Matching string: "fhrrhahhr" (similar to the character that is not a word boundary on both sides of a, here matching Character ).
 Other anchors: "\ A" is similar to "^". The Anchor matches the beginning of the subject. This method is not applicable to all regular expressions, but can be used in Perl and PCRE. to match the end of a subject, use "\ Z ", "\ Z" can also be used in some contexts ".
 Example:
Regular Expression: "\ AAAAA \ Z"
Matching string: "aaaa" (a string starting and ending with AAAA, that is, the start and end of the subject)
 The nominal value of metacharacters: The character set between "\ Q" and "\ e" can be used to match the string literal value .". ^ $ * +? | () {} []-"The 15 metacharacters have special meanings in regular expressions and are used to write matching modes. The hyphen (-) is used in square brackets of the regular expression to indicate the range. In other cases, it has no special meaning. If you directly enter these characters in a regular expression, they will not be displayed. If you want to display these characters, you need to place them between "\ Q" and "\ e". Of course, you can also add "\" in front of it.
 Example:
Regular Expression:"\ Q $ \ e" or "\ $" 
Matching character: $ character itself
 Practical installation: Continue to load the load as in the previous section, continue to add tags, and continue to use the Linux sed command BB. The command (I) in SED allows you to insert text before a file or a position in the string. The opposite command is (a), which adds text after a position. Examples of actual regular expressions of SED (or grep, VI, and VIM) are not provided here. Google will try it on your own. Here we will focus on regular expressions.
 Summary: I learned the boundary and assertion (zero-width assertion ). If there is no summary, start the essence of the Regular Expression and continue to BB.
 4. Selection, grouping, and backward reference Select Operation: Select operation can match one of multiple optional modes. For example, if you want to find the number of times that "the" (the, the, the) appears in the "the android developer need Fix bug on the bug system.", then select the mode.
 Regular Expression: "(The | the | The)" or "(? I)"
Original string: "the android developer need Fix bug on the bug system ."
Matching result: the,
 The above regular expression matches all the upper and lower case of.
 The following are other options and modifiers (Note: The following options do not apply to platforms with all regular expressions ): 
     
      
       
       Option 
       Description 
       Supported platforms 
       
      
      
       
       (? D) 
       Rows in UNIX 
       Java 
       
       
       (? I) 
       Case Insensitive 
       PCRE, Perl, Java 
       
       
       (? J) 
       Repeated names allowed 
       PCRE 
       
       
       (? M) 
       Multiple rows 
       PCRE, Perl, Java 
       
       
       (? S) 
       Single Row (dotall) 
       PCRE, Perl, Java 
       
       
       (? U) 
       Unicode 
       Java 
       
       
       (? U) 
       Default minimum match 
       PCRE 
       
       
       (? X) 
       Ignore spaces and comments 
       PCRE, Perl, Java 
       
       
       (? -...) 
       Restore or disable options 
       PCRE 
       
      
    
 Sub-mode: The sub-mode is one or more groups in the regular expression group, that is, the mode in the mode. In most cases, the condition in the submode can be matched on the premise that the previous mode is matched, but there are also exceptions (for example, "(The | the | The)" matching the condition does not depend on, because the match will be performed first. In this example, there are three sub-modes: the, the, and the. There are many sub-pattern expressions. Here we only focus on the sub-pattern in the ARC.
 Example (Child pattern matching depends on the previous pattern ):
 Regular Expression: (T | T) H (E | E)
Match: the,
 In the above example, the second sub-mode "(E | E)" depends on the first sub-mode "(T | T )".
 In particular, the arc is not required for submode !!!!! As follows:
 Regular Expression: "\ B [TT] H [EE]"
Match: the,
 The above "[TT]" character group can be considered as the first sub-mode, the same as the second.
 Capture group and backward reference: When all or part of a mode is grouped by a pair of parentheses, the content is captured and temporarily stored in the memory. The captured content can be reused in the back reference as follows:
 "\ 1", "\ 2", or"  1 "," 2 ", N captured groups.
 Only the group "\ 1" is accepted in the SED command.
 Example (simulate backward reference using the SED command in Linux ):
 echo "YanBo is an Android Developer!" | sed -En ‘s/(YanBo is) (an Android Developer)/\2 \1/p‘
 Output: An android developer Yanbo is!
Explanation:
-E is the SED regulator ere (extended regular expression). Therefore, parentheses can be used as the literal value.
-N overwrites the default settings for each row.
Capture group 1, 2 to replace.
Naming group: A named group is a group with a name. You can reference a group by name instead of a number.
 Naming group Syntax: 
     
      
       
       Syntax 
       Description 
       
      
      
       
       (?<name>...) 
       Naming Group 
       
       
       (?name...) 
       Another way to group names 
       
       
       (?P<name>...) 
       Naming group in Python 
       
       
       \k<name> 
       Reference Group name in Perl 
       
       
       \k‘name‘ 
       Reference Group name in Perl 
       
       
       \g{name} 
       Reference Group name in Perl 
       
       
       \k{name} 
       Reference Group name in. net 
       
       
       (?P=name) 
       Reference Group name in Python 
       
      
    
Non-capturing group :** Non-capturing groups do not store their content in the memory. You can use it when you do not want to reference a group. Because groups are not stored, non-capture groups have high performance.
 Example:
 Write the capture group as follows: "(The | the | )"
You do not need any backward reference and can write it :"(? : The | the | )"
Case Insensitive :"(? I )(? : The) "or" (? :(? I) The) "or (recommended )"(? I: )"
 Atomic group: There is also an atomic Group for Non-capturing groups. If you use the Regular Expression Engine to perform the rollback operation, this type of grouping can disable the rollback operation, but it only competes for the atomic grouping part, not the entire expression. Syntax:
 "(?> The )"
 One reason for the slow regular expression is the rollback operation.
 Summary: I don't have a summary. I Want To Continue loading and flying. The following force level is higher !!!
 6. Regular Expression-quantifiers Greedy, lazy, possession: The quantifiers are greedy. Greedy quantifiers first match the entire string, and then roll back one by one until the matching is found. So he consumes the most resources.
 A lazy quantizer uses another strategy. It searches for matching from the starting position of the target, checks a character each time, and finally tries to match the entire string. If you want to change the quantifiers to lazy, you must add a question mark (?) after the common quantifiers (?).
 The placeholder quantizer will overwrite the entire target and then try to find the matching content, but it will only try once and will not go back. After a common quantizer, a plus sign (+) is added ).
 Regular Expression*、+、?Match: The following basic quantifiers are greedy by default. 
     
      
       
       Syntax 
       Description 
       
      
      
       
       ? 
       Zero or one 
       
       
       + 
       One or more 
       
       
       * 
       Zero or multiple 
       
      
    
 For example:
Regular Expression: "9 +"
Match: one or more 9
 Matching times: The following arc quantifiers are the most accurate quantifiers for matching. By default, they are greedy. 
     
      
       
       Syntax 
       Description 
       
      
      
       
       {N} 
       Exact match n times 
       
       
       {N ,} 
       Match N or more times 
       
       
       {M, n} 
       Match m-N times 
       
       
       {0, 1} 
       And? Same, zero or once 
       
       
       {1, 0} 
       Same as +, one or more 
       
       
       {0 ,} 
       Same as *, zero or more 
       
      
    
 Lazy quantizer: In practice, this lazy quantizer is:
 Regular Expression: "8 ?"
Match: one or zero 8
 Regular Expression: "8 ??" (Lazy)
Match: A single 8 does not match, because it is lazy and as few as possible.
 Regular Expression: "8 *?" (Lazy)
Match: A single 8 does not match, because it is lazy and as few as possible.
 Regular Expression: "8 + ?" (Lazy)
Matching: an 8 value is matched.
 Regular Expression: "8 {3, 8 }?" (Lazy)
Matching: Three 8 matches.
 Lazy quantifiers: 
     
      
       
       Syntax 
       Description 
       
      
      
       
       ?? 
       Lazy match 0-1 times 
       
       
       +? 
       1-multiple lazy matches 
       
       
       *? 
       Lazy match 0-multiple times 
       
       
       {N }? 
       Lazy match multiple times 
       
       
       {N ,}? 
       Lazy match N-multiple times 
       
       
       {M, n }? 
       Lazy Match m-N times 
       
      
    
 Quantifiers: Posite Word Table: 
     
      
       
       Syntax 
       Description 
       
      
      
       
       ? + 
       Occupy matching 0-1 times 
       
       
       ++ 
       Possession Matching 1-multiple times 
       
       
       * + 
       Possession matches 0-multiple times 
       
       
       {N} + 
       Possession matching multiple times 
       
       
       {N ,}+ 
       Possession match N-multiple times 
       
       
       {M, n} + 
       Occupy M-N times 
       
      
    
 Example:
 Regular Expression: "1. * +"
Matching: All 1 values are highlighted.
 Regular Expression: ". * + 1"
Match: no match, because there is no rollback.
 Regular Expression: ". * 1"
Match: match the string whose end is 1. Greedy mode.
 Summary: The quantifiers introduced here are the essence of the regular expression efficiency. If there is no explanation, continue to force the regular expression to continue flying.
 VII. Regular Expression-View Loop view is a non-capturing group, also known as a zero-width assertion.
 Forward Looking: Example:
 Regular Expression :"(? I) AAA (? = BBB )"
Original string: "aaa ccc bbb aaa bbb ccc AAA"
Match: only the second "AAA" is matched ".
 The above is to match AAA, and the AAA word is followed by BBB. Positive foresight is used to achieve the goal.
 Anti-Foresight: Reverse lookup is a forward-looking reverse operation.
 Example:
 Regular Expression :"(? I) AAA (?! Bbb )"
Original string: "aaa ccc bbb aaa bbb ccc"
Match: only the first "AAA" is matched ".
 The above is the matching of AAA, and the AAA word is not followed by BBB. Use anti-forward to achieve the goal.
 Follow-up: The positive and forward direction is the opposite.
 Example:
 Regular Expression :"(? <= Aaa) BBB"
Original string: "aaa ccc bbb aaa bbb ccc AAA"
Match: only the second "BBB" is matched ".
 Reverse Gu: The reverse direction is opposite to the reverse direction.
 Example:
 Regular Expression :"(?
 Summary: This section does not need to be summarized as an example.
 Finale This is almost enough for the entire regular expression. To sum up the learning method, you just need to be bold in practice, think about it, and then verify it in the editor.
 Regular Expression Basics

Option	Description	Supported platforms
(? D)	Rows in UNIX	Java
(? I)	Case Insensitive	PCRE, Perl, Java
(? J)	Repeated names allowed	PCRE
(? M)	Multiple rows	PCRE, Perl, Java
(? S)	Single Row (dotall)	PCRE, Perl, Java
(? U)	Unicode	Java
(? U)	Default minimum match	PCRE
(? X)	Ignore spaces and comments	PCRE, Perl, Java
(? -...)	Restore or disable options	PCRE

Syntax	Description
`(?<name>...)`	Naming Group
`(?name...)`	Another way to group names
`(?P<name>...)`	Naming group in Python
`\k<name>`	Reference Group name in Perl
`\k‘name‘`	Reference Group name in Perl
`\g{name}`	Reference Group name in Perl
`\k{name}`	Reference Group name in. net
`(?P=name)`	Reference Group name in Python

Syntax	Description
?	Zero or one
+	One or more
*	Zero or multiple

Syntax	Description
{N}	Exact match n times
{N ,}	Match N or more times
{M, n}	Match m-N times
{0, 1}	And? Same, zero or once
{1, 0}	Same as +, one or more
{0 ,}	Same as *, zero or more

Syntax	Description
??	Lazy match 0-1 times
+?	1-multiple lazy matches
*?	Lazy match 0-multiple times
{N }?	Lazy match multiple times
{N ,}?	Lazy match N-multiple times
{M, n }?	Lazy Match m-N times

Syntax	Description
? +	Occupy matching 0-1 times
++	Possession Matching 1-multiple times
* +	Possession matches 0-multiple times
{N} +	Possession matching multiple times
{N ,}+	Possession match N-multiple times
{M, n} +	Occupy M-N times

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More