Introduction of regular expressions

Source: Internet
Author: User
Tags alphabetic character character set chr contains control characters expression integer numeric
Regular

If you have not used regular expressions before, you may not be familiar with this term or concept. However, they are not as novel as you might think.

Recall how you found the file on your hard disk. Are you sure you will use it? and * characters to help find the file you are looking for. The character matches a single character in the file name, while the * matches one or more characters. One like ' data?. DAT ' mode can be found in the following files:

Data1.dat

Data2.dat

Datax.dat

DataN.dat

If you use the * character instead? Character, the number of files found will be enlarged. ' Data*.dat ' can match all of the following file names:

Data.dat

Data1.dat

Data2.dat

Data12.dat

Datax.dat

DataXYZ.dat

Although this method of searching for files is certainly useful, it is very limited. and * The limited ability of wildcards allows you to have a concept of what regular expressions can do, but regular expressions are more powerful and flexible.

-------------------------------------------------------
2. Early origins

Early origins

The "ancestors" of regular expressions can be traced back to early studies of how the human nervous system works. Warren McCulloch and Walter Pitts, two neuroscientists, have developed a mathematical way of describing these neural networks.

In 1956, an American mathematician named Stephen Kleene, based on the early work of McCulloch and Pitts, published a paper entitled "Representation of neural network events", introducing the concept of regular expressions. A regular expression is an expression that describes what he calls the algebra of a regular set, so the term "regular expression" is used.

Subsequently, it was found that this work could be applied to some early studies using Ken Thompson's computational Search algorithm, and Ken Thompson was the main inventor of Unix. The first practical application of regular expressions is the QED editor in Unix.

As they say, the rest is a well-known history. From then until now regular expressions are an important part of text-based editors and search tools.
--------------------------------------------------------
3. Using regular expressions

In a typical search and replace operation, you must provide the exact text you want to find. This technique may be sufficient for simple search and replace tasks in static text, but because of its lack of flexibility, it is difficult or even impossible to search for dynamic text.

Using regular expressions, you can:

1. Test a pattern for a string. For example, you can test an input string to see if there is a phone number pattern or a credit card number pattern in the string. This is known as data validation.

2. Replace text. You can use a regular expression in your document to identify specific text, and then you can delete it all, or replace it with another text.

3. Extracts a substring from a string based on pattern matching. Can be used to find specific text in text or input fields.

For example, if you need to search the entire Web site to remove some outdated material and replace some HTML formatting tags, you can use regular expressions to test each file to see if there are any material or HTML formatting tags that you want to find in the file. With this method, you can narrow the affected file range to those files that contain the material you want to delete or change. You can then use regular expressions to delete obsolete materials, and finally, you can use regular expressions again to find and replace those that need to be replaced.

Another example that illustrates the usefulness of regular expressions is a language whose string-handling power is not yet known. VBScript is a subset of Visual Basic that has rich string processing capabilities. Visual Basic scripting Edition similar to C does not have this capability. Regular expressions have significantly improved the ability of the Visual Basic scripting Edition to handle string processing. However, it may still be more efficient to use regular expressions in VBScript, which allows multiple string operations to be performed in a single expression

4. Regular expression syntax
A regular expression is a literal pattern consisting of ordinary characters (such as characters A through Z) and special characters (called metacharacters). This pattern describes one or more strings to be matched when looking for a text body. A regular expression is used as a template to match a character pattern with the string being searched for.

Here are some examples of regular expressions that you might encounter:

Visual basicvbscript Matching
Scripting Edition

/^\[\T]*$/CHR (^\[) \t]* $CHR (34) matches a blank line.

/\D{2}-\D{5}/CHR (34) verifies that an ID number consists of a 2-bit character, a hyphen, and a 5-digit number.

/< (. *) &GT;.*&LT;\/\1&GT;/CHR () < (. *) &GT;.*&LT;\/\1&GT;CHR (34) matches an HTML tag.


The following table is a complete list of metacharacters and its behavior in the context of regular expressions:

Character description

\ marks the next character as a special character, or a literal character, or a back reference, or a octal escape character. For example, the ' n ' matching character chr (34) NCHR. ' \ n ' matches a newline character. Sequence ' \ \ ' matches Chr (34) and Chr (\CHR) (Chr (34) matches Chr (34).

^ matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ' \ n ' or ' \ R '.

$ matches the end position of the input string. If the multiline property of the RegExp object is set, the $ also matches the position before ' \ n ' or ' \ R '.

* Match the preceding subexpression 0 or more times. For example, zo* can match Chr (34) and Chr (ZCHR) ZOOCHR (34). * is equivalent to {0,}.

+ matches the preceding subexpression one or more times. For example, ' zo+ ' can match CHR ZOCHR (34) and Chr (34), but cannot match ZOOCHR (34). + is equivalent to {1,}.

? Match the preceding subexpression 0 times or once. For example, Chr do (es) Chr (34) can match Chr (34) in Chr (34) DOESCHR (34) or Chr (DOCHR). is equivalent to {0,1}.

{N}n is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in Chr bobchr (34), but it can match two o in Chr FOODCHR (34).

{N,}n is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in Chr BOBCHR (34) but can match all o in Chr (a) FOOOOODCHR (34). ' O{1,} '
Equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.

{N,m}m and n are non-negative integers, where n <= m. Matches n times at least and matches up to M times. Liu, Chr (O{1,3}CHR) (34) will match the first three O in Chr (FOOOOOODCHR) (34). ' o{0,1} ' is equivalent to ' o '. Note that you cannot have spaces between commas and two numbers

? When the character is immediately following any of the other qualifiers (*, +,?, {n}, {n,},{n,m}), the matching pattern is not greedy. Non-greedy patterns match as few strings as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for string Chr (OOOOCHR), ' o+? ' will match a single CHR OCHR (34), and ' o+ ' will match all ' o '.

. matches any single character except Chr (34) \nchr. To match any character including ' \ n ', use a pattern like ' [. \ n] '.

(pattern) matches the pattern and gets the match. The obtained matches can be obtained from the resulting matches collection, use the Submatches collection in VBScript, and use the $0...$9 property in visual Basic scripting Edition. To match the parentheses character, use ' \ (' or ' \ ').

(?:p Attern) matches pattern but does not get matching results, which means that this is a non fetch match and is not stored for later use. This is useful for combining parts of a pattern using Chr (34) or Chr (34) characters (|). For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '.

(? =pattern) forward lookup, matching the find string at the beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match Chr (34) in Windows 2000CHR (34), but it does not match CHR WINDOWS3.1CHR (34) Chr (CHR 34). It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.

(?! pattern) Negative pre-check, matching at the beginning of any mismatched negative lookahead matches the search string at any point where a string does matching pattern Find the string. This is a non-fetching horse.
In other words, the match does not need to be acquired for later use. For example, ' Windows (?! 95|98| nt|2000) ' Can match Chr (34) Chr WINDOWSCHR (34) in Windows 3.1CHR (+),
However, it does not match the Chr (34) WINDOWSCHR (34) in CHR Windows 2000CHR. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.

X|y matches x or Y. For example, ' Z|food ' can match CHR ZCHR (34) or Chr (FOODCHR) (34). ' (z|f)
Ood ' matches Chr ZOODCHR (34) or Chr (FOODCHR) (34).

[XYZ] Character set combination. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in Chr (34) plainchr.

[^XYZ] Negative character set combination. Matches any characters that are not included. For example, ' [^ABC] ' can match ' P ' in Chr PLAINCHR (34).

[A-z] character range. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the range ' a ' to ' Z '.

[^a-z] a negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not in the range of ' a ' to ' Z '.

\b Matches a word boundary, which refers to the position between the word and the space. For example, ' er\b ' can match ' er ' in Chr neverchr (34), but it does not match ' er ' in Chr verbchr (34).

\b Matches a non word boundary. ' er\b ' can match ' er ' in Chr verbchr (34), but cannot match ' er ' in Chr (34).

\CX matches the control characters indicated by X. For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.

\d matches a numeric character. equivalent to [0-9].

\d matches a non-numeric character. equivalent to [^0-9].

\f matches a page feed character. Equivalent to \x0c and \CL.

\ n matches a newline character. Equivalent to \x0a and \CJ.

\ r matches a carriage return character. Equivalent to \x0d and \cm.

\s matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].

\s matches any non-white-space character. equivalent to [^ \f\n\r\t\v].

\ t matches a tab character. Equivalent to \x09 and \ci.

\v matches a vertical tab. Equivalent to \x0b and \ck.

\w matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.

\w matches any non word character. Equivalent to ' [^a-za-z0-9_] '.

\XN matches N, where n is the hexadecimal escape value. The hexadecimal escape value must be a determined two digits long. For example, ' \x41 ' matches CHR ACHR (34). ' \x041 ' is equivalent to ' \x04 ' & Chr 1CHR (34). You can use ASCII encoding in regular expressions ...

\num matches num, where num is a positive integer. A reference to the match that was obtained. For example, ' (.) \1 ' matches two consecutive identical characters.

\ n identifies a octal escape value or a back reference. N is a back reference if you have at least N obtained subexpression before. Otherwise, if n is an octal number (0-7), then N is an octal escape value.

\NM identifies a octal escape value or a back reference. NM is a \nm if at least one of the preceded by the at least NM gets the subexpression before the If there are at least N fetches before \nm, then N is a back reference followed by a literal m. If all the preceding conditions are not satisfied, if both N and M are octal digits (0-7), then \nm will match octal escape value nm.

\NML if n is an octal number (0-3) and both M and L are octal digits (0-7), then the octal escape value NML is matched.

\un matches N, where N is a Unicode character represented in four hexadecimal digits. For example, \u00a9 matches the copyright symbol (?).

5. Establishment of regular expressions

The method for constructing regular expressions is the same as for creating mathematical expressions. That is, using multiple metacharacters and operators to combine small expressions to create larger expressions.

You can construct a regular expression by putting together various components of an expression pattern between a pair of delimiters. For Visual Basic scripting Edition, the delimiter is a pair of forward slash (/) characters. For example:

/expression/

For VBScript, a pair of quotation marks (CHR) Chr (34)) is used to determine the bounds of the regular expression. For example:

Chr (EXPRESSIONCHR) (34)

In the two examples shown above, regular expression patterns (expression) are stored in the Pattern property of the RegExp object.

<<------------------------------------------------------>>

6. Priority order
After a regular expression is constructed, it can be evaluated like a mathematical expression, that is, it can be evaluated from left to right and in a priority order.

The following table lists the order of precedence for various regular expression operators from highest priority to lowest priority:

Operator description

\ escape Character

(), (?:), (? =), [] parentheses and square brackets

*, +,?, {n}, {n,}, {n,m} qualifier

^, $, \anymetacharacter position and order

| or the action

<<---------------------------------------------------------->>

7. Ordinary characters
Normal characters are made up of all print and nonprinting characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation marks, and some symbols.

The simplest regular expression is a single ordinary character that matches the character itself in the searched string. For example, the single character pattern ' a ' can match the letter ' a ' that appears anywhere in the searched string. Here are some examples of word single-character patterns:

/a/
/7/
/m/

The equivalent VBScript word single-character expression is:

Chr (ACHR) (34)
Chr (7CHR) (34)
Chr (MCHR) (34)

You can combine multiple single characters together to get a larger expression. For example, the following Visual Basic scripting Edition Regular expressions are not something else, that is, an expression created by combining single character expressions ' A ', ' 7 ', and ' M '.

/a7m/

The equivalent VBScript expression is:

Chr (A7MCHR) (34)

Please note that there are no connection operators here. All you have to do is put one character behind another character.

<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>

8. Special characters

There are a number of metacharacters that require special processing when trying to match them. To match these special characters, you must first escape the characters, which means that you use a backslash (\) earlier. These special characters and their meanings are given in the following table:

Special Character description

$ matches the end position of the input string. If the multiline property of the RegExp object is set, then $ also matches ' \ n ' or ' \ R '. To match the $ character itself, use \$.

() marks the start and end position of a subexpression. The subexpression can be obtained for later use. To match these characters, use \ (and \).

* Match the preceding subexpression 0 or more times. To match the * character, use \*.

+ matches the preceding subexpression one or more times. to match the + character, use \+.

. matches any single character except \ n of a newline character. to match., please use \.

Marks the beginning of a bracket expression. To match [, use \[.

? matches the preceding subexpression 0 or more times, or indicates a non-greedy qualifier. Want to match? characters, please use \?.

\ marks the next character as either a special character, or a literal character, or a back reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches line breaks. Sequence ' \ \ ' matches Chr (34), while ' \CHR ' matches Chr (Chr (34).

^ matches the starting position of the input string, unless used in a bracket expression, at which point it means that the character set is not accepted. To match the ^ character itself, use \^.

{marks the beginning of a qualifier expression.} To match {, use \{.

| Specify a choice between the two items. to match |, use \|.


9. nonprinting characters

There are a number of very useful nonprinting characters that must be used occasionally. The following table shows the escape sequences used to represent these nonprinting characters:

Character meaning

\CX matches the control characters indicated by X. For example, \cm matches a control-m or carriage return character.
The value of x must be one-a-Z or a-Z. Otherwise, see C as a literal ' C ' word
Character.

\f matches a page feed character. Equivalent to \x0c and \CL.

\ n matches a newline character. Equivalent to \x0a and \CJ.

\ r matches a carriage return character. Equivalent to \x0d and \cm.

\s matches any white space character, including spaces, tabs, page breaks, and so on. Equivalent to
[\f\n\r\t\v].

\s matches any non-white-space character. equivalent to [^ \f\n\r\t\v].

\ t matches a tab character. Equivalent to \x09 and \ci.

\v matches a vertical tab. Equivalent to \x0b and \ck.

<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>

10. Character Matching

a period (.) matches any single printed or nonprinting character in a string, except for line breaks (\ n). The following Visual Basic scripting Edition Regular expressions can match ' AAC ', ' abc ', ' ACC ', ' ADC ' and so on, and can also match ' A1c ', ' a2c ', a-c ' and a#c ':/a.c/

The equivalent VBScript regular expression is:

Chr (A.CCHR) (34)

If you try to match a string that contains a filename, where the period (.) is part of the input string, you can implement this requirement by preceding the period in the regular expression with a backslash (\) character. For example, the following Visual Basic scripting Edition Regular expressions can match ' filename.ext ':/filename\.ext/

For VBScript, an equivalent expression looks like this:

Chr (FILENAME\.EXTCHR) (34)

These expressions are still quite limited. They only allow matching of any single character. In many cases, it is useful to match Special characters from a list. For example, if your input text contains chapter headings that are represented numerically as Chapter 1, Chapter 2, and so on, you may need to find these chapter headings.


Bracket expression

You can create a list to match by putting one or more single characters in a square bracket ([and]). If the character is enclosed in parentheses, the list is called a bracket expression. In parentheses and anywhere else, ordinary characters represent themselves, that is, they match the one that appears in the input text. Most special characters lose their meaning when they are in a bracket expression. Here are some exceptions:

1. '] ' character if it is not the first item, a list will be closed. To match the '] ' character in the list, place it in the first item, immediately after the start ' ['.

2. ' \ ' is still an escape character. to match the ' \ ' character, please use ' \ '.

The characters contained in the bracket expression match only a single character of the bracket expression where it is located in the regular expression. The following Visual Basic scripting Edition Regular expressions can match ' Chapter 1 ', ' Chapter 2 ', ' Chapter 3 ', ' Chapter 4 ', and ' Chapter 5 ':

/chapter][12345]/

To match the same chapter headings in VBScript, use the following expression:

Chr (Chapter) [12345]CHR (34)

Note that the word ' Chapter ' and the following spaces are fixed with the position of the characters in parentheses. Therefore, the bracket expression is used only to specify a character set that satisfies the single characters position immediately following the word ' Chapter ' and a space. This is the Nineth character position.

If you want to use a range instead of the character itself to represent a character to be matched, you can use a hyphen to separate the start and end characters of the range. The character value of each character determines its relative order within a range. The following Visual Basic scripting Edition Regular expression contains a range expression that is equivalent to the list of parentheses shown above.

/chapter [1-5]/

An expression of the same functionality in Vbscipt is shown below:

Chr (Chapter) [1-5]CHR (34)

If you specify a range in this manner, both the start and end values are included within that range. It is particularly important to note that the starting value in a Unicode sort must precede the end value.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.