Parsing JS Regular Expressions

Source: Internet
Author: User
Tags character classes
A regular expression is an object that describes the character mode.

The Regexp object and string object in Javascript define a method that uses regular expressions to execute powerful pattern matching and text retrieval and replacement functions.

In JavaScript, regular expressions are represented by a Regexp object. of course, you can use a Regexp () constructor to create a Regexp object, or use a new special syntax added in Javascript 1.2 to create a Regexp object. just as the string's direct quantity is defined as a character contained in quotation marks, the regular expression's direct quantity is also defined as a character contained between a slash. therefore, JavaScript may contain the followingCode:

VaR pattern =/S $ /;

This line of code creates a new Regexp object and assigns it to the parttern variable. this special Regexp object matches all strings ending with the letter "S. regexp () can also be used to define an equivalent regular expression. The Code is as follows:

VaR pattern = new Regexp ("S $ ");

It is easy to create a Regexp object, whether using a regular expression or using a constructor Regexp. A more difficult task is to use regular expression syntax to describe the character mode. javascript uses a fairly complete subset of Perl's regular expression syntax.

The pattern specification of a regular expression is composed of a series of characters. most characters (including all letters, numbers, and characters) Describe character matching by literal meaning. in this way, the regular expression/Java/matches all strings containing the sub-string "Java. although other characters in the regular expression do not match by literal meaning, they all have special meanings. regular Expression/S $/contains two characters.

The first special character "S" matches itself by literal meaning. the second character "$" is a special character that matches the end of a string. therefore, the regular expression/S $/matches the string ending with the letter "S ".
.

1. directly count characters

We have found that all the letters and numbers in the regular expression match their own literal meanings. the JavaScript regular expression also supports some non-

Letter character. for example, the sequence "\ n" matches a direct line feed in the string. many punctuation marks have special meanings in regular expressions. the following are the characters and their meanings:

The direct character count of the regular expression.

Character matching
________________________________
Letter/digit character
\ F page feed
\ N linefeed
\ R press ENTER
\ T Tab
\ V vertical Tab
\/One/Direct Volume
\ One \ Direct Volume
\. A. Direct Volume
\ * A * Direct Volume
\ + One + Direct Volume
\? One? Direct Volume
\ | One | direct quantity
\ (One (Direct Volume
\) One) Direct Volume
\ [A [Direct Volume
\] One Direct Volume
\ {One {direct quantity
\} Direct quantity of one}
\ Xxx ascii characters specified by the decimal number XXX
\ Xnn ASCII characters specified by hexadecimal NN
\ CX control character ^ X. For example, \ Ci is equivalent to \ t, \ CJ is equivalent to \ n

___________________________________________________

To use special punctuation marks in regular expressions, you must add "\" before them "\".

2. character classes

Put a separate direct character in brackets to form a character class. A character class matches any one of its characters, so the regular expression/[ABC]/and the letter "A", "B ", any one of "C" matches. you can also define a negative character class that matches all characters except those contained in brackets. when defining a negative character tip, you must use a ^ symbol as the first character counted from the left brackets. the set of regular expressions is/[a-zA-z0-9]/.

Because some character classes are very commonly used, the regular expression syntax of JavaScript contains some special characters and escape sequences to represent these commonly used classes. for example, \ s matches space characters, tabs and other blank characters, and \ s matches any character other than blank characters.

Regular Expression gray character classes

Character matching
____________________________________________________
[...] Any character in parentheses
[^...] Any character not in parentheses
Any character except line breaks is equivalent to [^ \ n]
\ W any single character, equivalent to [a-zA-Z0-9]
\ W any non-single character, equivalent to [^ a-zA-Z0-9]
\ S any blank space character, equivalent to [\ t \ n \ r \ f \ v]
\ S any non-blank character, equivalent to [^ \ t \ n \ r \ f \ v]
\ D any number, equivalent to [0-9]
\ D any character except number, equivalent to [^ 0-9]
[\ B] A return direct quantity (Special Case)
________________________________________________________________

3. Copy

Using the regular expression syntax above, you can describe two digits as/\ D/and four digits as/\ D /. however, there is no way to describe any number with Multiple Digits or

String. the string consists of three characters and a digit following the letter. these complex patterns use the regular expression syntax to specify the number of times each element in the expression will appear again.

The specified characters always follow the pattern in which they are applied. some replication types are quite common. therefore, some special characters are used to indicate them. for example, the "+" number matches the previous mode once or multiple times. the following table lists the replication syntax. let's take a look at an example:

// \ D {2, 4} // match the numbers between 2 and 4.

/\ W {3} \ D? /// Match three single-character characters and an arbitrary number.

// \ S + Java \ s + // match the string "Java", and there can be one or more spaces before and after the string.

/[^ "] * // Matches zero or multiple non-quoted characters.

Duplicate characters of Regular Expressions

Character meaning
__________________________________________________________________
{N, m} matches the first item at least N times, but cannot exceed M times
{N,} matches the previous item n times or multiple times.
{N} matches the first item EXACTLY n times.
? Match the first item 0 or 1, that is, the first item is optional. equivalent to {0, 1}
+ Match the previous item once or multiple times, equivalent to {1 ,}
* Match the first item 0 or multiple times. It is equivalent to {0 ,}
___________________________________________________________________

4. Select, group, and reference

The regular expression syntax also includes specifying selection items, grouping subexpressions, and referencing special characters of the previous subexpression. character | used to separate the selected characters. for example,/AB | cd | EF/matches the string "AB", "cd", or "Ef ". /\ D {3} | [A-Z] {4}/matches either a three-digit number or four lower-case letters. brackets in regular expressions have several functions. its main function is to group A single project into a sub-expression, so that it can use *, +, or? To process those projects. For example:/Java (SCRIPT )? /Match the string "Java", which can be followed by either "script" or no. /(AB | cd) + | ef)/matches either the string "Ef" or the string "AB" or "cd.

In a regular expression, the second purpose of parentheses is to define the child pattern in the complete pattern. When a regular expression matches the target string, you can extract the part that matches the child pattern in the brackets from the target string. for example, if the pattern we are retrieving is one or more letters followed by one or more digits, we can use the pattern/[A-Z] + \ D + /. however, given that we really care about the numbers at the end of each matching, if we put the numeric part of the pattern in brackets (/[A-Z] + (\ D +) /), we can extract numbers from any matching results, and then we will parse the numbers.

Another purpose of the subexpression of parentheses is to allow us to reference the previous subexpression after the same regular expression. this is achieved by adding one or multiple digits after the string. A number refers to the position of the subexpression of the parentheses in the regular expression. for example, \ 1 references the first child expression in parentheses. \ 3 references the child expression of the third parenthesis. note that a subexpression can be nested in other subexpressions, so its position is the position of the left parenthesis to be counted.

For example, the following regular expression is specified as \ 2:
/([JJ] Ava ([ss] Ghost) \ SIS \ s (fun \ W *)/

The reference to the first subexpression in a regular expression is not the pattern of the subexpression, but the text that matches the pattern. in this way, the reference is not just a shortcut to help you enter the duplicate part of the regular expression, but also implements a protocol, that is, the separated parts of a string contain identical characters. for example, the following regular expression matches all characters in single or double quotation marks. however, it requires that the start and end quotation marks match (for example, both are double quotation marks or both are single quotation marks ):

/['"] [^'"] * ['"]/

If the start and end quotation marks are required to match, we can use the following reference:

/(['"]) [^'"] * \ 1/

\ 1 matches the pattern matched by the first child expression in parentheses. in this example, it implements a statute, that is, the start quotation marks must match the ending quotation marks. note: If the number followed by the backslash is more than the number of subexpressions in parentheses, it will be parsed into a decimal escape sequence instead of a reference. you can use the complete three characters to represent the escape sequence, which can avoid confusion. for example, use \ 044 instead of \ 44. the following are the selection, grouping, and reference characters of the regular expression:

Character meaning
______________________________________
| Select. match either the child expression on the left of the symbol or the child expression on the right of the symbol.
(...) Grouping. Several projects are divided into one Unit. This unit can be divided by *, + ,? And |. You can also remember the characters that match this group for future reference.
\ N matches the characters matching the nth group. The group is a subexpression (which may be nested) in the brackets. The group number is the number of left parentheses counted from left to right.
______________________________________

5. Specify the matched location

As we can see, many elements in a regular expression can match a character in a string. for example, \ s matches only a blank character. some elements in the regular expression match the space of 0 characters in width, rather than the actual characters. For example, \ B matches the boundary of a word, that is, the boundary between A/W character and a \ W non-word character. characters such as \ B do not specify any character in the matched string. They specify the valid position where the matching occurs. sometimes we call these elements the anchor of a regular expression. because they locate the pattern in a specific position in the search string. the most common anchor element is ^, which enables the pattern to depend on the start of the string, while $ indicates that the pattern is located at the end of the string.

For example, to match the word "JavaScript", we can use a regular expression/^ JavaScript $ /. if we want to retrieve the word "Java" itself (not as prefix in "JavaScript"), we can use the pattern/\ s Java \ s /, it requires spaces before and after the word Java. but there are two problems. first, if "Java" appears at the beginning or end of a character. this mode will not match, unless there is a space at the beginning and end. second: When this mode finds a matched character, it returns a matched string with spaces at the front end and backend. This is not what we want. therefore, we use the boundary \ B of words to replace the real space character \ s for matching. the result expression is/\ B Java \ B /.

The following are the anchor characters of the regular expression:

Character meaning
____________________________________________________________________
^ Match the beginning of a character. In multi-row search, match the beginning of a line
$ Matches the end of a character. In multi-row search, it matches the end of a row.
\ B matches the boundary of a word. In short, it is located between the characters \ W and \ W (Note: [\ B] matches the return character)
\ B matches non-word boundary characters
_____________________________________________________________________

6. Attributes

The regular expression syntax also has the last element, which is the attribute of the regular expression. It describes the rules for advanced pattern matching. unlike other regular expression syntaxes, attributes are described outside the/symbol. that is, they do not appear between two slashes, but are located behind the second slash. javascript 1.2 supports two attributes. attribute I indicates that pattern matching is case insensitive. attribute G indicates that pattern matching should be global. that is to say, we should find all the matches in the searched string. these two attributes can be combined to perform a global, case-insensitive match.

For example, you need to perform an insensitive search to find the first specific value of the word "Java" (or "Java", "Java", etc, we can use a non-sensitive regular expression/\ B Java \ B/I. if you want to find all the specific values of "Java" in a string, you can also add the attribute g, that is,/\ B Java \ B/GI.

The following are the attributes of a regular expression:

Character meaning
_________________________________________
I. Perform case-insensitive matching.
G executes a global match. In short, it finds all the matches, instead of stopping them after finding the first one.
_________________________________________

In addition to attributes G and I, regular expressions do not have other features like properties. if you set the static attribute multiline of the Regexp constructor to true, the pattern matching is performed in multiline mode. in this mode, the anchor character ^ and $ match not only the start and end of the search string, but also the beginning and end of a row inside the search string. for example, the pattern/Java $/matches "Java", but does not match "Java \ NIS fun ". if we set the multiline attribute, the latter will also be matched:

Regexp. multiline = true;

The regular expression object contains a regular expression pattern ). It has the properties and methods that use the regular expression mode to match or replace a string with a specific character (or character set combination ). To add attributes to a single regular expression, you can use the regular expression constructor function ), the pre-configured regular expression has static properties (the predefined Regexp object has static properties that are set whenever any regular expression is used, I don't know if it is correct. I will list the original text. Please translate it by yourself ).

    • Create:
      A text format or regular expression Constructor
      Text Format:/pattern/flags
      Regular Expression constructor: New Regexp ("pattern" [, "Flags"]);
    • Parameter description:
      Pattern -- a regular expression text
      Flags -- if it exists, it will be the following values:
      G: Global match
      I: case insensitive
      GI: combination of the above

[Note]Parameters in the text format are not enclosed in quotation marks, but must be enclosed in quotation marks when constructors are used. For example,/AB + C/I new Regexp ("AB + C", "I") implements the same function. In the constructor, some special characters need to be converted (Add "\" before special characters "\"). For example, Re = new Regexp ("\ W + ")

Special characters in Regular Expressions

Character Meaning
\

In turn, that is, the characters after "\" are not interpreted as original meaning, such as/B/matching character "B ", when a backslash is added before B/\ B/, it is converted to match the boundary of a word.
-Or-
Restores the function characters of a regular expression. For example, if "*" matches the previous metacharacters 0 or multiple times,/a */matches a, AA, AAA, after "\" is added,/a \ */will only match "*".

^ Matches the beginning of an input or a line,/^ A/matches "an A", but does not match "an"
$ Matches the end of an input or line,/a $/matches "an A", but does not match "an"
* Match the previous metacharacters 0 or multiple times./Ba */matches B, Ba, Baa, baaa
+ Match the previous metacharacters once or multiple times./Ba */matches Ba, Baa, baaa
? Match the first metacharacters 0 or 1 times,/Ba */will match B, Ba
(X) Match X and save X in the variable $1... $9.
X | y Match X or Y
{N} Exact match n times
{N ,} Match more than N times
{N, m} Match N-m times
[Xyz] Character set (Character Set), which matches any one of the characters (or metacharacters) in the set)
[^ XYZ] Does not match any character in this set
[\ B] Match a return character
\ B Match the boundary of a word
\ B Match non-boundary of a word
\ CX Here, X is a control character,/\ cm/matches Ctrl-m
\ D Match a word character,/\ D/=/[0-9]/
\ D Match a non-word character,/\ D/=/[^ 0-9]/
\ N Match A linefeed
\ R Match a carriage return.
\ S Matches a blank character, including \ n, \ r, \ f, \ t, \ v, etc.
\ S Matches a non-blank character, equal to/[^ \ n \ f \ r \ t \ v]/
\ T Match a tab
\ V Match a Duplicate Tab
\ W Match a character that can make up a word (alphanumeric, this is my free translation, containing numbers), including underscores, such as [\ W] matching 5 in "$5.98", equal to [a-zA-Z0-9]
\ W Match a character that cannot make up a word, such as [\ W] matching $ in "$5.98", equal to [^ a-zA-Z0-9].

After talking about this, let's look at some examples of the practical application of Regular Expressions:

Email address verification:
Function test_email (stremail ){
VaR myreg =/^ [_ a-z0-9] + @ ([_ a-z0-9] + \.) + [a-z0-9] {2, 3} $ /;
If (myreg. Test (stremail) return true;
Return false;
}
HTML code shielding
Function mask_htmlcode (strinput ){
VaR myreg =/<(\ W +)> /;
Return strinput. Replace (myreg, "& lt; $1 & gt ;");
}

Attributes and methods of a regular expression object
Predefined regular expressions have the following static attributes: input, multiline, lastmatch, lastparen, leftcontext, rightcontext, and $1 to $9. Input and Multiline can be pre-set. Values of other attributes are assigned different values based on different conditions after the exec or test method is executed. Many attributes have both long and short (Perl style) names, and these two names point to the same value. (JavaScript simulates the Regular Expression of Perl)
Attributes of a regular expression object

Attribute Description
$1... $9 If it exists, it is a matched substring.
$ _ See Input
$ * See multiline
$ & See lastmatch
$ + See lastparen
$' See leftcontext
$' See rightcontext
Constructor Create a special function prototype for an object
Global Match in the entire string (bool type)
Ignorecase Whether to ignore the case sensitivity when matching (bool type)
Input Matched string
Lastindex Last matched Index
Lastparen Substring enclosed in parentheses
Leftcontext The last match takes the left substring
Multiline Whether multi-row matching is performed (bool type)
Prototype Allow attributes to be attached to objects
Rightcontext The last matched substring to the right
Source Regular Expression Mode
Lastindex Last matched Index


Regular Expression object Method

method meaning
compile Regular Expression comparison
exec perform search
test match
tosource literal representing is returned for a specific object, and its value can be used to create a new object. This is obtained by reloading the object. tosource method.
tostring returns the string of a specific object. The result is obtained by reloading the object. tostring method.
valueof returns the original value of a specific object. Obtain the reload object. valueof method

example

output "Smith, John"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.