Getting Started with regular expressions (regex), metacharacters (special character) learning and improving _ regular expressions

Source: Internet
Author: User
Tags alphabetic character html tags numeric new set string format

What is a regular expression?
Regular expressions, also known as formal representations, general representations (English: Regular Expression, often abbreviated as regex, RegExp, or re) in code, are a concept of computer science. A regular expression uses a single string to describe and match a series of strings that conform to a certain syntactic rule. is useful in almost every kind of computer programming language. Can be divided into ordinary regular expressions, extended regular expressions, advanced regular expressions. Common regular expressions are commonly used in Linux shells, and the advanced regular expression syntax is essentially a Perl evolution. The current common Program language (php,perl,python,java,c#) supports advanced regular expressions.

Why do we have to learn regular expressions?
The regular expressions for advanced programming languages are almost always developed from the Perl language, so the syntax is almost identical. You have learned, a regular expression of language. Can be used in almost all program languages. Like, I know the SQL syntax, and the backend mysql,mssql are almost universal. This is one of the reasons we need to learn regular expressions, commonality. Another reason: Regular expressions are powerful text-matching features. Many text matching processing, if there is no regular expression, it is really difficult to do it. such as: From a string, read the phone number format, if we use string lookup, need to do a loop, need to write judgment. Estimated to consume a lot of code, development time. If you use regular expressions, just one line of code is OK. Match all pairs: HTML tags, if you want to do this, we find it very complex to deal with layers, to match tags. Generally a few short hours may not be completed. If you use regular expressions, it is estimated to be only a few minutes.

Regular expression string format
Now that we know the importance of regular expressions, commonality. Then we can understand the common format. General regular expressions consist of a string of ordinary characters + special characters (metacharacters). such as: Match "AB start, followed by the number string" "ab\d+" which AB is ordinary characters, \d representative can be 0-9 digits, + represents the preceding character can appear 1 times or more. Haha, it looks really easy!

Regular expressions, whether normal or extended, or advanced regular expressions. The difference may be somewhat different in terms of special characters. Many special characters that can be combined to form a new set of matching rules. I don't think it's too deep here. We generally just know its common meta characters. Basically common regular expressions can be written out.

The following are common metacharacters of JavaScript regular expressions:

character Description
\ Marks the next character as a special character, or a literal character, or a back reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches a newline character. Sequence ' \ ' matches ' \ ' and ' \ (' Matches ' (".
^ Matches the start position of the input string. If the Multiline property of The RegExp object is set, ^ also matches the position after ' \ n ' or ' \ R '.
$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, the $ also matches the position before ' \ n ' or ' \ R '.
* Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * is equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}.
? Match the preceding subexpression 0 times or once. For example, "Do (es)" can match "do" in "do" or "does". is equivalent to {0,1}.
{n} N is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{n,} N is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ' but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{n,m} m and n are nonnegative integers, of which n <= m. Matches n times at least and matches up to m times. Liu, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' o '. Notice that there is no space between the comma and the two number.
? When the character is immediately following any other qualifier (*, +,?, {n}, {n,}, {n,m}), the matching pattern is not greedy. Non-greedy patterns match as few strings as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "oooo", ' o+? ' will match a single "O", and ' o+ ' will match all ' o '.
. Matches any single character except "\ n". To match any character including ' \ n ', use a pattern like ' [. \ n] '.
(pattern) Match pattern and get this match. The obtained matches can be obtained from the generated matches collection, the submatches collection is used in VBScript, and in JScript the $... $ Properties. To match the parentheses character, use ' \ (' or ' \ ').
(?:pattern) Matches pattern but does not get a matching result, which means it is a non fetch match and is not stored for later use. This is useful for combining parts of a pattern with the "or" character (|). For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '.
(? =pattern) Forward lookup, matching the find string at The beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match windows in Windows 2000, but cannot match windows in Windows 3.1. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.
(?! pattern) Negative pre-check, in any mismatch negative lookahead matches the search string at any point where a string does not matching pattern matches the lookup string at the beginning of the string. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (?! 95|98| nt|2000) ' Can match windows in Windows 3.1, but cannot match windows in Windows 2000. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check.
x| y Match x or y. For example, ' Z|food ' can match "z" or "food". ' (z|f) Ood ' matches ' zood ' or ' food '.
[XYZ] Character set combination. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in ' plain '.
[^XYZ] Negative character set combination. Matches any characters that are not included. For example, ' [^ABC] ' can match ' P ' in ' plain '.
[A-Z] The range of characters. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the range ' a ' to ' Z '.
[^ A-Z] Negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not in the range of ' a ' to ' Z '.
\b Matches a word boundary, which is the position between the word and the space. For example, ' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '.
\b Matches a non-word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.
\cx Matches the control character indicated by x . For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
\d Matches a numeric character. equivalent to [0-9].
\d Matches a non-numeric character. equivalent to [^0-9].
\f Matches a page feed character. Equivalent to \x0c and \CL.
\ n Matches a line feed character. Equivalent to \x0a and \CJ.
\ r Matches a carriage return character. Equivalent to \x0d and \cm.
\s Matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s Matches any non-white-space character. equivalent to [^ \f\n\r\t\v].
\ t Matches a tab character. Equivalent to \x09 and \ci.
\v Matches a vertical tab. Equivalent to \x0b and \ck.
\w Matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.
\w Matches any non word character. Equivalent to ' [^a-za-z0-9_] '.
\xN Matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be a determined two digits long. For example, ' \x41′ matches ' A '. ' \x041′ is equivalent to ' \x04′& ' 1. You can use ASCII encoding in regular expressions ...
\Num Matches num, where num is a positive integer. A reference to the match that was obtained. For example, ' (.) \1′ matches two consecutive identical characters.
\N Identifies a octal escape value or a back reference. n is a back-reference if at least n obtained subexpression before \ n. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
\nm Identifies a octal escape value or a back reference. Ifnm Before at least is preceded by at least nm Gets the subexpression, the nm As a back reference. Ifnm Before at least there N A fetch, the N To a followed text m 's back reference. If the preceding conditions are not satisfied, if N And mare octal digits (0-7), then \nm Will match octal escape value nm
\NML If n is an octal number (0-3), and m and l is an octal number (0-7), matching the octal escape value NML.
\uN Matches n, where n is a Unicode character represented in four hexadecimal digits. For example, \u00a9 matches the copyright symbol (?).

From the above meta characters, we see that many metacharacters can actually represent a set of ordinary characters. So, we're going to match some strings, and regular expressions tend to have a lot of kinds. such as: Matching 0-9 numbers, you can use [0-9],\d,[0123456789], so that 3 kinds of can, all roads to Rome, are right. So is that a regular expression better, more performance, faster matching? Through 100,000 cycles of matching, it is found that some of the few differences are small, \d speed is faster than [0-9], [0-9] faster than [0123456789]. From the point of view of regular expression streamlining, \d is the simplest. When we use it, we try to match it by representing character set metacharacters. Compact and fast!

How do you write regular expressions?
We write regular expressions that start with an analysis of the matching string feature, and then gradually complement other metacharacters, ordinary characters. Matches from left to right.

For example: We want to match a mobile phone number.

1. Analysis of string characteristics, mobile phone number is a number, and is the beginning of 1, 11-bit long

2. Can write "1\d" 1 start, followed by the number can also be: 1[0-9]

3. The number length is 11 bits, continue to add 1\d{10}, followed by a number of 11 characters, can also be: 1[0-9]{10}; {} number, indicating that its left character can repeat occurrences

4. All characters must be 11 bits, so the ^1\d{10}$ must meet the conditions, so it can be:

For example: we match QQ number

1. The analysis QQ number characteristic is, the number is at least 5 digits, the first character Fu Fei 0, the maximum length, now to 11 bit

2. You can first define first character, [1-9]\d first character is 1 to 9, followed by character

3. After the number of characters in the 4 to 10 bits [1-9]\d{4,10}

4. All strings must satisfy the above match because they can be written as: ^[1-9]\d{4,10}

For example: matching IP addresses

1. Analysis of IP structure is, 0-255 in each section, the middle with "." Split, a total of 4 knots

2. First we write the first 0-255, which can be decomposed into 0-9 digits, 10-99 double digits, 100-199 three digits, 200-249 three digits 2nd, 250-255 Fourth

[0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5] "|" Indicates or, the calculation priority is the lowest, the left and right sides can be multiple metacharacters of the common character combination string as a whole.

3. Such characters, there are three repetitions, the Middle plus ".", so the result is:

[0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]\. , because the dot character is a meta character, all needs to be escaped. This is not OK, we found that there is a problem, "|" The lowest priority, this will be the last \. Character chart, combined as: "25[0-5] \." Out. So it should be the first few cases, followed by a "." Character, correct is: ([0-9]|[ 1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \. , and that's what's required. We will find that virtually every () character in each home matches a child, and the content appears in the matching result. Here we add () The purpose is to give priority to the calculation, so there is no need to match the contents of the face. We can ignore the child matching content characters:?:, the result will change to: (?: [0-9]|[ 1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.

4. A paragraph has been matched, and then we need to repeat it three times, we can repeat the previous expression 3 times directly:

Method One: (?: [0-9]|[ 1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \. (?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \. (?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.

Method Two: The first paragraph as a grouping, repeat 3 times (?: [0-9]|[ 1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.) {3}, and then also ignores the child matching result, which can be changed to:

(?:(?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.) {3} haha, see this expression is not very dizzy, in fact, a long expression, are added from 1.1 points. This use, the number of repetitions, the result of a lot of simplification.

5. Finally there is a 0-255 match

(?:(?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.) {3} (?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]), that is, add one more 0-255 match in the back, then add the top and end qualifier on it, and turn to: ^ (?:(?: [0-9]|[ 1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) \.) {3} (?: [0-9]| [1-9] [0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]) $

The following graph, is read a paragraph of text inside, all IP format address

which (? = ...) is a forward match, searching for the left string, and the right side of the string must be satisfied? = Match success after matching success!

Well, having written all these examples, I find that it is possible to match this long expression from a very simple regular expression. It is not strange that the long regular expressions are derived from simple regular expressions. Gradually added to it. Welcome to discuss the exchange!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.