Regular expression parsing in JavaScript

Source: Internet
Author: User
Tags format define bool expression function prototype functions variable tostring
Javascript| Regular

  A regular expression is an object that describes a character pattern.

JavaScript's RegExp objects and string objects define methods that use regular expressions to perform powerful pattern matching and text retrieval and substitution functions.

In JavaScript, regular expressions are represented by a RegExp object. Of course, you can use a regexp () constructor to create RegExp objects, or you can use JavaScript A new special syntax added in 1.2 to create the RegExp object. Just as a string literal is defined as a character enclosed in quotation marks, the regular expression literal is also defined as a character that is contained between a pair of slashes (/). Therefore, JavaScript may contain the following code:

var pattern =/s$/;

This line of code creates a new RegExp object and assigns it to the variable parttern. This particular RegExp object matches all strings that end with the letter "s". You can also define an equivalent regular expression by using regexp (), as follows:

var pattern = new RegExp ("s$");

Whether using a regular expression directly or using a constructor regexp (), it is easy to create a RegExp object. The more difficult task is to use regular expression syntax to describe the pattern of characters. JavaScript is a fairly complete subset of the regular expression syntax for Perl language .

The pattern specification for regular expressions is made up of a series of characters. Most characters, including all alphanumeric characters, describe characters that are matched literally. In this case, the regular expression/java/and all of the containing substring "Java" String. Although the other characters in the regular expression are not matched by literal meaning, they all have special meanings. The regular expression/s$/contains two characters.

The first special character "S" is the literal meaning of matching itself. The second character "$" is a special character that matches the end of the string. So the regular expression/s$/matches the end of the letter "s".
The string.

   1. Direct measure character

We have found that in regular expressions all alphabetic characters and numbers are matched by literal meaning to themselves. The regular expression of JavaScript also supports some of the non-, through escape sequences that begin with a backslash (\).

Alphabetic characters. For example, the sequence "\ n" matches a literal newline character in a string. In regular expressions, many punctuation marks have special meanings. Here are the characters and their meanings:

The direct measure character of a regular expression

Character matching
________________________________
Alpha-numeric characters themselves
\ F Page Feed
\ n Line Feed
\ r Carriage Return
\ t tab
\ v Vertical Tab
\/One/Direct quantity
\ \ A \ Direct quantity
\ . One. Direct quantity
* A * Direct quantity
\ + A + direct quantity
\ ? One? Direct quantity
\ | One | Direct quantity
\ (One (direct quantity
) A direct amount
\ [one [Direct quantity
\] A direct amount
\ {a {Direct amount
\} A direct amount
\ XXX ASCII code characters specified by decimal number xxx
\ xnn ASCII code characters specified by the hexadecimal number nn
\ CX control character ^x. For example, \ci is equivalent to \ t, \CJ is equivalent to \ n

___________________________________________________

If you want to use special punctuation marks in regular expressions, you must precede them with a "\".

   2. Character class

You can combine individual direct characters into a character class by putting them in brackets. A character class matches any of the characters it contains, so the regular expression/[ABC]/And the letter "a", "B", and "C" all match. In addition, you can define a negative character class, These classes match all characters except those contained within the brackets. To define a negative character tip, use a ^ symbol as the first character from the left bracket. The collection of regular expressions is/[a-za-z0-9]/.

Because some character classes are very common, the regular expression syntax for JavaScript contains special characters and escape sequences to represent these commonly used classes. For example, \s matches spaces, tabs, and other whitespace characters, and \s matches any character other than whitespace.

Regular table-Gray character classes

Character matching
____________________________________________________
[...] Any character that is within the parentheses
[^...] Any character not in parentheses
. Any character other than a line break, equivalent to [^\n]
\w any single word character, equivalent to [a-za-z0-9]
\w any non-word character, equivalent to [^a-za-z0-9]
\s any whitespace, equivalent to [\ t \ n \ r \ f \ V]
\s any non-white-space character, equivalent to [^\ t \ n \ r \ f \ V]
\d any number, equivalent to [0-9]
\d any character other than a number, equivalent to [^0-9]
[\b] A backspace direct amount (special case)
________________________________________________________________

   3. Copy

With the above regular table syntax, you can describe a two-digit number AS/d/d/, and describe the four-digit number AS/\d \ d \ d \ d/. But we don't have a way to describe a number with any number of digits or a

String. This string is composed of three characters and a number following the letter. These complex patterns use regular expression syntax that specifies the number of times each element in the expression repeats.

Specifies that the copied characters always appear after the mode in which they are acting. Because some types of replication are fairly common. So there are some special characters that are specifically used to represent them. For example, the + number matches the pattern of copying the previous pattern one or more times. The following table lists the replication syntax. First look at an example:

/\d{2, 4}///match numbers between 2 and 4.

/\W{3} \d?///Match three single characters and an arbitrary number.

/\s+java\s+///matches the string "Java", and can have one or more spaces before and after that string.

/[^ "] *///Match 0 or more non-quote characters.


Copy character of regular expression

Character meaning
__________________________________________________________________
{n, m} matches the previous item at least n times, but not more than m times
{n,} matches n times before, or multiple
{n} matches the previous item exactly n times
? Matches the previous item 0 or 1 times, which means the previous item is optional. Equivalent to {0, 1}
+ matches 1 or more times before, equivalent to {1,}
* Match the previous item 0 or more times. Equivalent to {0,}
___________________________________________________________________


   4. Select, Group and reference

The syntax of a regular expression also includes specifying a selection, grouping the subexpression, and referencing the special characters of the previous subexpression. Character | Used to separate the characters for selection. For example:/ab|cd|ef/matches the string "AB", or the string "CD", or "EF". /\d{3}| [A-z] {4}/matches either a three-digit number or four lowercase letters. Parentheses have several functions in regular expressions. Its main function is to separate the items into a subexpression so that it can be treated like a separate unit with *, +, or. To deal with those projects. For example:/java (script)?/matches the string "Java", which can be either "script" or not. /(AB|CD) + |ef)/match can be either the string "EF" or the string "AB" or "CD" once or multiple repetitions.

In a regular expression, the second purpose of parentheses is to define the child mode in the complete pattern. When a regular expression succeeds in matching the target string, the You can extract the part of the target string that matches the child pattern in parentheses. For example, suppose that the pattern we are retrieving is followed by one or more digits, then we can use the pattern/[A-z] + \ d+/. But since we're supposed to be really concerned with the numbers of each matching tail, so if we put the numeric portion of the pattern in parentheses (/[A-Z] + (\d+)/), we can extract the numbers from any matches retrieved, and then we'll parse that.

Another use of the parenthetical subexpression is to allow us to refer to the preceding subexpression after the same regular expression. This is done by adding one or more digits to the string. The number refers to the position of the subexpression of the bracket in the regular expression. For example: \1 refers to the first parenthesis subexpression. \3 refers to the third bracket subexpression. Note that because the subexpression can be nested within other subexpression, its position is the position of the left parenthesis being counted.

For example, the following regular expression is specified as \ 2:
/([Jj]ava ([Ss]cript)) \sis \s (fun\w*)/


A reference to the previous subexpression in the regular expression does not specify the pattern of that subexpression, it's the text that matches that pattern. So the reference is not just a shortcut to help you enter the repeating part of the regular expression, it also implements a statute That's a string. The separate parts of the strings contain exactly the same characters. For example, the following regular expression matches all characters that are within a single or double quotation mark. However, it requires quotation marks that start and end to match (for example, two are double quotes or single quotes):

/[' "] [^ ' "]*[' "]/


If you require quotation marks to match the start and end, we can use the following reference:

/([' "]) [^ '"] * \1/

The

\1 matches the pattern that is matched by the first parenthetical subexpression. In this example, it implements a specification that the opening quotation marks must match the closing quotation marks. Note that if the backslash follows a number more than the number of subexpression brackets, then it is parsed into a decimal escape sequence. Instead of a reference. You can persist in using the full three characters to represent the escape sequence, which avoids confusion. For example, use \044 instead of \44. The following are the selection, grouping, and reference characters for regular expressions:

Character meaning
________________ ______________________
| Select. Matches either the subexpression to the left of the symbol, or the subexpression on the right side of it
(...). divides several items into one unit. This unit can be made up of *, +,? and | The use of symbols, you can also remember the characters that match this group for subsequent references that match the characters that are matched by the
\ n and nth groupings. The grouping is a subexpression (possibly nested) in parentheses. The group number is a left to right count of the number of left parentheses
_________________ _____________________

5. Specify a matching location

We have seen that many elements in a regular expression can match one character of a string. For example: \s The matching is just a blank. There are also some regular expression elements that match the space between the characters with a width of 0, rather than the actual characters such as: \b matches the boundary of a word, which is the boundary between a/w character and a \w character. Like \b Such a character does not specify a character in any of the matched strings, they specify a valid location for the match to occur. Sometimes we call these elements the anchors of regular expressions. Because they position the pattern in a specific position in the retrieved string. The most commonly used anchor element is ^, which makes the pattern dependent on the beginning of the string, and the anchor element $ causes the pattern to be positioned at the end of the string.

For example, to match the word "javascript", we can use regular expressions/^ JavaScript $/. If we want to retrieve the word "Java" itself (not as a prefix in "JavaScript"), then we can use the pattern/\s Java \s/, which requires spaces before and after the word java. But there are two problems with this. First: If "Java" appears in the The beginning or end of a character. The pattern will not match unless there is a space at the beginning and end. Second: When this pattern finds a matching character, it returns a matching string with spaces at the front and back end, which is not what we want. Therefore, we use the boundary \b of words to replace the real spaces \s. The resulting expression is the/\b Java \b/.

The following are the anchor characters for the regular expression:


Character meaning
____________________________________________________________________
^ matches the beginning of a character, and in multiple-line retrieval, it matches the beginning of a line
The $ match is the end of the character, and in multiple-row retrieval, the match is the end of a line
\b matches the boundary of a word. In short, the position between the character \w and the \w (note: [\b] matches backspace)
\b A character that matches the bounds of a non-word
_____________________________________________________________________

  6. Property

The syntax for regular expressions also has the last element, that's the property of the regular expression, which shows the rules for advanced pattern matching. Unlike other regular expression syntaxes, attributes are described outside of the/symbol. That is, they do not appear between the two slashes, but are positioned after the second slash. JavaScript 1.2 supports two properties. Attribute I shows that pattern matching should be case insensitive. Attribute g indicates that pattern matching should be global. That is, you should find all the matches in the retrieved string. These two properties combine to perform a global, case-insensitive match.

For example, to perform a case-insensitive retrieval to find the word "Java" (or "Java", "Java", and so on), we can use regular expressions that are insensitive to/\b java\b/i. If you want to find all the "Java" values in a string, we You can also add property g, which is the/\b Java \b/gi.

The following are the properties of the regular expression:


Character meaning
_________________________________________
I perform case insensitive matching
G to perform a global match, in short, is to find all the matches, instead of stopping after the first one is found.
_________________________________________

In addition to the properties G and I, regular expressions have no other attribute-like attributes. If you set the static property multiline of the constructor RegExp to true, pattern matching will be in a multiline mode. In this mode, the anchor characters ^ and $ match not just the beginning of the retrieved string and the At the end, it also matches the beginning and end of a line within the retrieved string. For example: pattern/java$/matches "Java", but does not match "Java\nis fun". If we set the multiline attribute, the latter will also be matched:

Regexp.multiline = true;

The regular expression (regular expression) object contains a regular expression pattern. It has attributes (properties) and methods (methods) that match or replace a particular character (or set of characters) in a string (string) with a regular expression pattern. To add a property to a single regular expression, you can use the regular expression constructor (constructor function), regardless of when a preset regular expression that is invoked has a static property (the predefined RegExp object has Static properties that are set whenever any regular expression is used, I do not know if I turned the right, the original list, please self-translation.

    • Create:
      A text format or regular expression constructor
      Text Format:/pattern/flags
      Regular expression constructor: New RegExp ("pattern" [, "flags"]);
    • Parameter description:
      Pattern--a regular expression literal
      Flags--if present, will be the following values:
      G: Global Match
      I: Ignore case
      GI: Above combination

[ Note ] parameters in text format do not use quotes, and arguments when used with constructors require quotes. such as:/ab+c/i new RegExp ("Ab+c", "I") is the implementation of the same function. In constructors, some special characters need to be transferred (plus "\" before a special character). such as: Re = new RegExp ("\\w+")

Special characters in regular expressions

Character Implications
\

As a turn, that is, the characters usually after "\" do not interpret the original meaning, such as the/b/matching character "B", when B is preceded by a backslash/\b/, turn to match the boundary of a word.
Or
A restore of a regular expression feature character, such as "*" matches its preceding metacharacters 0 or more times,/a*/will match a,aa,aaa, and after "\",/a\*/will only match "a *".

^ Match an input or the beginning of a line,/^a/matches "an A", but does not match "an A"
$ Match an input or end of a line,/a$/matches "an A" and does not match "an A"
* Matches the preceding metacharacters 0 or more times,/ba*/will match b,ba,baa,baaa
+ Matches the preceding metacharacters 1 or more times,/ba*/will match ba,baa,baaa
? Match the preceding metacharacters 0 or 1 times,/ba*/will match B,ba
(x) Match x Save x in a variable named $1...$9
X|y Match x or Y
N Exact Match n times
{N,} Match n times above
{N,m} Matching n-m times
[XYZ] Character set (character set) that matches any one by one characters (or metacharacters) in this collection
[^XYZ] does not match any one of the characters in this collection
[\b] Match a backspace
\b Match the bounds of a word
\b Match the non-boundary of a word
\cx Here, X is a control character,/\cm/match ctrl-m
\d Matches a character number character,/\d/=/[0-9]/
\d Matches a non-word number character,/\d/=/[^0-9]/
\ n Match a line feed
\ r Match a return character
\s Match a blank character, including \n,\r,\f,\t,\v, etc.
\s Matches a non-white-space character equal to/[^\n\f\r\t\v]/
\ t Match a tab
\v Match a heavy-straight tab
\w Match a character that can make up a word (alphanumeric, which is my transliteration, with numbers), including underscores, like [\w] matches 5 in "$5.98", equals [a-za-z0-9]
\w Matches a character that cannot be made into words, such as [\w] matches $ in "$5.98", equal to [^a-za-z0-9].

Having said so much, let's look at some examples of the actual application of regular expressions:

e-mail address verification:
function Test_email (stremail) {
var myreg =/^[_a-z0-9]+@ ([_a-z0-9]+\.) +[a-z0-9]{2,3}$/;
if (Myreg.test (Stremail)) return true;
return false;
}
Masking of HTML code
function Mask_htmlcode (strinput) {
var Myreg =/< (\w+) >/;
Return Strinput.replace (Myreg, "<$1>");
}

properties and methods of regular expression objects
Predefined regular expressions have the following static properties: Input, Multiline, Lastmatch, Lastparen, Leftcontext, Rightcontext, and $ $. Where input and multiline can be preset. The values of other properties are assigned different values according to different conditions after the exec or test method has been executed. Many attributes have both long and short (Perl-style) two names, and the two names point to the same value. (JavaScript simulates Perl's regular expression)
The properties of the regular Expression object
Property Meaning
$1...$9 If it exists, it is the substring of the match.
$_ See Input
$* See Multiline
$& See Lastmatch
$+ See Lastparen
$` See Leftcontext
$ See Rightcontext
Constructor Create a special function prototype for an object
Global Whether to match in the entire string (bool type)
IgnoreCase Whether to ignore case (bool type) when matching
Input String to be matched
Lastindex Last-matched index
Lastparen Substring enclosed in the last bracket
Leftcontext Last match with left substring
Multiline Whether to do multiple rows matching (bool type)
Prototype Allow attached properties to object
Rightcontext Last matching substring with right
Source Regular expression pattern
Lastindex Last-matched index

Methods of regular Expression objects
Method Meaning
Compile Comparison of regular expressions
Exec Perform a Lookup
Test to match
Tosource Returns the definition of a particular object (literal representing), whose value can be used to create a new object. Overloaded Object.tosource method is obtained.
Tostring Returns a string for a particular object. Overloaded Object.ToString method is obtained.
valueof Returns the original value of a particular object. The method of overloading object.valueof
Example
<script language = "JavaScript" >
var Myreg =/(\w+) \s (\w+)/;
var str = "John Smith";
var newstr = str.replace (Myreg, "$, $");
document.write (NEWSTR);
</script>
Will output "Smith, John"


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.