Pattern matching for JavaScript regular expressions

Source: Internet
Author: User
Tags add define object character classes constructor end expression string

A regular expression, which is an object that describes the character pattern. The RegExp class for JavaScript represents regular expressions, and both string and RegExp define methods, which use regular expressions for powerful pattern matching and text retrieval and substitution functions. A regular expression of JavaScript is a subset of the size of this expression syntax for PERL5, so programmers with Perl programming experience are a piece of cake to learn about JavaScript's regular expressions.

This chapter first describes the regular expression syntax used to describe the text pattern. It then explains the use of the expression string and the RegExp method.

1. Definition of regular expression

Regular expressions in JavaScript are represented by a RegExp object, and you can use the RegExp () constructor to create RegExp objects, but RegExp objects are created more by a specific amount of direct syntax, just as you define string constants by quoting a character by quotation marks. The direct amount of this expression is defined as a character that is contained between a heap of slashes (/). For example:

var patten =/s$/;

Run this code to create a new RegExp object and assign it to Patten. This particular RegExp object is used to match a string ending with "s". The constructor also RegExp () can also define a regular expression equivalent to the following code:

var patten = new RegExp ("s$");

RegExp Direct volume and creation of objects

As with strings and numbers, it is obvious that the original type of direct quantity for each value in the program represents the same value. A new object is created each time the program runs with the object's direct amount (initialization expression) such as {} and []. For example, if you write var a = [] in the loop body, a new empty array is created for each traversal.

The direct volume of regular expressions differs from this, and ECMASCRIPT3 stipulates that a regular expression of a direct amount is converted to a RegExp object when it is executed, then each operation that represents the direct amount of the regular expression in the same piece of code returns the same object. The ECMAScript5 does the opposite, and the expression of the same code represents a new object for each operation. IE has always been implemented according to the ECMASCRIPT5 specification.

The pattern rules for regular expressions are made up of a sequence of characters. The majority of characters, including all letters and numbers, describe only the characters to be matched, according to the direct amount. For example, the regular expression/java/can match any string that contains a "Java" substring. In addition, there are other special semantics characters in the regular expression, which do not match the literal meaning, such as the regular expression/s$/contains two characters, the first character "s" matches the literal meaning, and the second character $ is a special meaning character. Used to match the end of the string. Therefore, this regular expression represents a string that can end with any "s".

The next few sections describe the use of various character and metacharacters characters in JavaScript regular expressions.

I. Direct-measure characters

As mentioned above, all characters and numbers in a regular expression are matched by literal meaning. JavaScript Regular expression syntax also supports character matching for flying letters, which need to be escaped by a backslash (\) as a prefix. For example, escape characters \ n are used to match line breaks. The following Lee places these escape characters.

Direct-measure characters in JavaScript regular expressions

Character The
alphabetic and numeric characters Their own
\o Nul character (\u0000)
\ t tab characters (\u0009)
\ n Line Feed (\u000a)
\v Vertical tab (\U000B)
\f Page Breaks (\u000c)
\ r return character (\u000d)
\xnn A Latin character specified by a hexadecimal number, such as \x0a equivalent to \ n
\uxxxx Unicode characters specified by the hexadecimal number xxxx, such as \u0009 equivalent to \ t
\cx Control character ^x, for example, \CJ equivalent to a newline character \ n

in regular expressions, many punctuation marks have special meanings, and they are :

^ $ . * + ? = ! : \ / ( ) [ ] { }

In the next few verses, we will learn the meaning of these symbols, which only have special meaning in the wisdom of certain upper and lower parts of regular expressions. In other contexts, it is treated as a direct amount. However, to use the direct amount of these characters in a regular expression to match, you must use the prefix \, which is a pass rule. Other punctuation, such as @ and quotation marks, have no special meaning and are matched by literal meanings in regular expressions.

If you don't remember those punctuation marks that require a backslash escape, you can precede each punctuation with a backslash. Also note that many literal and numeric prefixes have a special meaning before the backslash, so try not to use backslashes to escape the literal and numeric numbers that you want to match by the direct amount. Currently, if you want to match the backslash itself in a regular expression with a direct amount, you must escape it by using a backslash. For example, the regular expression "/\\/" is used to match any string that contains a backslash.

II. Character class

Putting the direct measure character into square brackets alone is guilty of a character class (character Class). A character class can match any character it contains.

Therefore, the regular expression/[abc]/matches any of the letters "a", "B", and "C".

In addition, you can define a negative character class by "^", which matches characters that are not enclosed in square brackets. When defining a negative character class, a "^" is used as the first character in the left parenthesis. Regular expression/[^abc]/matches all characters except "A", "B", "C".

Character classes can use hyphens to represent the range of characters. To match lowercase letters in the Latin alphabet, you can use/[a-z]/to match any letter and number in the Latin alphabet, using/[a-za-z0-9]/.

Because some character classes are very common, they are represented using the escape of these special characters in the regular expression syntax of JavaScript. For example, \s matches spaces, tabs, and other Unicode whitespace characters. \s the characters that match the non-Unicode whitespace character. These characters are listed below, and the syntax for the character class is summed. (Note that some character class escapes can only match ASCII characters and has not been extended to handle Unicode class characters, such as/[\u0400-\u04ff]/to match all Cyrillic characters (is a Cyrillic alphabet).)

Character classes for regular expressions

Character The
[...] Any character within the square brackets
[^...] Any character not in square brackets
. Any character except a newline character and other Unicode line terminators
\w Words of any ASCII character, equivalent to [a-za-z0-9]
\w Any word that is not an ASCII character, equivalent to [^a-za-z0-9]
\s Any Unicode whitespace characters
\s Any characters that are not Unicode whitespace, note that \w and \s are different
\d Any ASCII number, equivalent to [0-9]
\d Any character other than the ASCII number, equivalent to [^0-9]
[\b] BACKSPACE Direct Quantity (special case)

Note that you can also write these special escape characters within square brackets. For example, because \s matches all whitespace characters, \d matches all numbers, so/[\s\d]/is matching any whitespace or number.

Note that there is a special case, and we will see that the escape character \b has a particular meaning, and when used in a character class, it represents a backspace, so to represent a backspace in a regular expression by a direct amount, you only need to use a character class with one element/[\b]/

Iii. Repetition
Two-digit numbers can be described as/\d\d/, and four-digit numbers are described as/\d\d\d\d/. But so far, there is no way to describe multiple digits, or to describe a string of three letters and a number. The more complex patterns in the regular expression syntax refer to the "recurring occurrences" of an element in the regular expression.

We follow a regular pattern followed by a tag that specifies that the character repeats. Because some types of repetition are very common, there are special characters that are specifically used to indicate planting conditions. For example, "+" is used to match the previous or multiple replicas. The following table summarizes the regular syntax that represents repetition.

Repeating character syntax for regular expressions

Character Meaning
{N,m} Match the previous item at least n times, but not more than m times
{N,} Matches n times or more times before
N Match n times before
? Matches 0 or 1 times before, meaning that the previous item is optional, equivalent to {0,1}
+ Matches 1 or more times before, equivalent to {1,}
* Matches 0 or more times before, equivalent to {0,}

Here are some examples:

/\d{2,4}///Match 2~4 number
/\w{3}\d?///Exact match three words and an optional number
/\s+java\s+///Match string "Java" with one or more spaces before and after
/[^ (]*///matching one or more non-opening parenthesis characters

In the use of "*" and "?" Note that because these characters can match 0 characters, they allow nothing to match. For example, the regular expression/a*/actually matches the string "BBBB" because the string contains 0 aces.

not greedy repetition.

Matching repeating characters in the preceding table matches as many as possible and allows subsequent regular expressions to continue to match. So what we call a "greedy" match. We can also use this expression to make a "non greedy" match. Simply follow a question mark in the End-of-file word: "??", "+?", "*?" or ' {1,5} '. For example, regular expression/a+/can match one or more consecutive letter A. When you use "AAA" as a matching string, the regular expression matches its three characters. But/a+?/can also match one or more consecutive letter A, but it is as few matches as possible. We also use "AAA" as a matching string, but the last pattern can only match the first A.


The result of using a non greedy matching pattern may not be the same as expected. Consider the following regular expression/a+b/, which can match one or more A, and a B. When using it to match "Aaab", you expect him to match a and last B. In practice, however, this pattern matches the entire string. Now try it again. Non-greedy match version/a+?b/, which matches as few as possible A and a B, when it is used to match "Aaab", you expect it to match a and last B. In practice, however, the pattern matches the entire string, exactly the same as the greedy pattern of the pattern. This is because matching pattern matching for regular expressions always looks for the first possible match in the string. Because the match starts with the first character of the string, a shorter match in its substring is not considered here.

IIII. Select, group, and reference
The syntax for regular expressions also includes special characters that specify a selection, a subexpression grouping, and a reference to a previous subexpression. The character "" is used to separate the characters for selection. For example,/abcdef/can match the string "ab", or it can match the string "CD", and can also match the string "EF". /d{3}[a-z]{4}/is a three-digit or 4-lowercase letter.

Note that the selection attempt matches always from left to right until a match is found. If the selection on the left matches, the match on the right is ignored. Even if it produces a better match. Therefore, when the regular expression/aab/matches the string "AB", it can only match the first character.

Parentheses in regular expressions have several effects. One effect is to combine individual items into a subexpression so that you can use "," *, "+" or "?" as if you were working with a separate unit. And so on to deal with the options within the unit. For example/java (script)?/can match the string "Java", then can have "script" also can not. /(ABCD) +ef/can match the string "EF" and can match one or more repetitions of the string "AB" or "CD".

In a regular expression, another function of parentheses is to define the child pattern in the complete pattern, and when a regular expression succeeds in matching the target string, it can be extracted from the target strings and matched with the child patterns in the parentheses (the last part of this chapter sees how to get these matching substrings). For example, assuming that the pattern we are retrieving is one or more lowercase characters followed by one or more digits, you can use pattern/[a-z]+\d+/. But assuming that we really care about the number of each matching tail, if you put the number of the pattern in parentheses (/[a-z]+ (\d)/), you can extract the numbers from the retrieved match, which is more detailed later.

Another use of parenthesized expressions is to allow the preceding subexpression to be referenced in the back of the same regular expression. This is done by adding a digit after the character "\". This number specifies the position of the parenthesized word expression in the regular expression. For example, \1 applies the first parenthesized subexpression, \3 refers to a third parenthesized subexpression. Note that because a subexpression can nest another subexpression, its position is the position of the left parenthesis that participates in the count. For example, in the following regular expression, a nested subexpression ([ss]cript) can be used to refer to the \2

/([jj]ava[ss]cript)?) \sis\s (fun\w*)/

A reference to the previous subexpression in the regular expression is not a reference to the subexpression pattern, but refers to the text that matches that pattern. In this way, a reference can be used to enforce a constraint that each individual part of a string contains exactly the same characters. For example, the following regular expression matches 0 or more characters that are within a single or double quotation mark. However, it does not match the quotation marks on the left and right (that is, the two quotes that are added are either single or double quotes):

/['"][^'"]*['"]/

If you want to match the quotation marks on the left and right, you can use the following reference:

/([' "]) [^ '"]*\1/

\1 matches the pattern that is matched by the first parenthesized subexpression. In this example, there is a constraint that the quotation marks on the left must match the quotation marks on the right. Regular expressions do not allow single quotes to be enclosed in double quotes, and vice versa. You cannot use this reference in a character class. So, the following wording is illegal.

/([' "]) [^\1]*\1/

In the last few sections of this chapter, we see that you can also group a subexpression without creating a reference with a digital encoding. It is not grouped by "(" and ")", for example, consider the following pattern

/([Jj]ava (?: [Ss]cript)?) \sis\s (fun\w*)/

Here, the subexpression (?: [Ss]cript) is used only for grouping, so copy the symbol "?" can be applied to individual groupings. This improved parenthesis does not generate a reference. So in this expression, \2 references the text that matches (fun\w*).

The following table summarizes the regular expression selection, grouping, and reference operators.

Selection, grouping, and reference characters for regular expressions

Character Meaning
Select to match the subexpression to the left of the symbol or to the right side of the child expression
(...) Combine to combine several items into a single unit, which can be passed "*", "+", ""? "and" 1 ", and can remember strings that match this combination for subsequent references and use
(?:...) Combine items into one unit but not memorize characters that match the reorganization
\ n Matches the first match of the Nth group, the group is a neutron expression (and possibly nested) of parentheses, and the group index is a left to right parenthesis number, and the "(?:" Form of the grouping is not encoded.)

IIIII. Specify a matching location

As described earlier, multiple elements in a regular expression can match one character of a string. For example, the \s match is just a blank character. There are also some regular expressions that match the positions between the characters rather than the actual characters. For example, \b matches the bounds of a word, which is the boundary between a \w (ASCII word) character and a \w (non-ASCII word), or a border between an ASCII word and the beginning or end of a string. Elements such as \b do not match a visible character, and they specify the legal location where the match takes place. Sometimes we call these elements anchors of regular expressions because they position the pattern at a specific location in the search string. The most commonly used anchor element is ^, which is used to match the start of the string, and the anchor element $ is used to match the end of the string.

For example, to match the word "javascirpt", you can use regular expression/^javascript$/. If you want to match the word "Java" itself (unlike "JavaScript" as a prefix for words), you can use/\sjavas/to match the word "Java" with spaces before and after. But there are two problems in doing so, first. If "Java" appears at the beginning or end of a string, the match is unsuccessful unless there is a space at the beginning and the end. The second problem is that when a string is found that matches it, the front and back ends of the matching string that it returns have spaces. Therefore, we use the word boundary \b instead of the real spaces \s for matching (or positioning). Such regular expressions are written in/\bjava\b/. The element \b locates the matching anchor point at the edge of the word. Therefore, the regular expression/\b[ss]cript/matches "JavaScript" and "PostScript", but does not match "script" and "scripting".

Any regular expression can be used as an anchor point condition. If you add an expression between the symbol "(? =" and "), it is a predicate assertion that the expression within the parentheses must match correctly, but is not a true match, and you can use/[jj]ava ([Ss]cript)? (? =\)/. This regular expression can match the JavaScript in "jvascript:the definitive Guide", but it cannot match "Java" in the "Java in nutshell" because there is no colon behind it.

Anchor characters in regular expressions

" meaning
 ^   matches the beginning of a string, in multiple-line retrieval , matching the beginning of a line
$ > matches the end of a string , in multiple-line retrieval, matching the end of a row
\b matches the boundary of a word, in short, where the character \w and \w, or between the character \w and the beginning or end of the string. (note, however, that [\b] matches the backspace)
places that match non word boundaries
(? =p) zero The wide forward lookahead assertion requires that the next character match the p, but does not include those characters that match P
0 wide negative lookahead assertion, requiring that the next character not match P


IIIIII. Modifier

The last knowledge point of the regular expression, the modifier of the regular expression, that describes the rules of the advanced matching pattern. Unlike the regular expression syntax discussed earlier, modifiers are placed outside the "/" symbol. That is, they do not appear between two slash lines. But after two diagonal lines. JavaScript supports three modifiers, and the modifier "I" is used to indicate that the match is case-insensitive. Modifier ' g ' indicates that pattern matching should be global, that is, to find all matches in the retrieved string. Modifier "M" is used to perform a match in multiple-line mode. In this mode, if there are more than one line of characters to be retrieved, the ^ and $ anchor characters can match the start and end of each line in addition to the beginning and end of the entire string. For example,/java$/im, can match Java, can also match "Java\nis Fun".

These modifiers can be any combination, for example, to make the first word "Java" (or "Java" or "Java") in the matching string case-insensitive, you can use the case-insensitive modifier to define the regular expression/\bjava\b/i, and if you want to match all the words in the string, You need to add the modifier g. /\bjava\b/gi

2. String method for pattern matching

To this point, the syntax of regular expressions has been discussed in this chapter, but no attempt has been made to use these regular expressions in JavaScript. This section discusses some of the methods used by string objects to perform regular expression pattern matching and retrieving substitution operations, and the following sections also discuss pattern matching for songs using JavaScript regular expressions, but vehicle and RegExp objects and its methods and properties.

String supports 4 methods that use regular expressions. The simplest is search (). Its argument is a regular expression that returns the position of the first substring to match, and if no matching substring is found, it returns-1. For example, the following call return value is 4:

"JavaScript". Search (/script/i)

If the search () parameter is not a regular expression, it is first converted to a regular expression through the regexp construct, and the search () method does not support global retrieval because it ignores modifier g in the regular expression argument.

The replace () method is used to perform retrieval and substitution operations. The first argument is a regular expression, and the second is the string to be replaced. This method retrieves the string to which it is invoked and matches it with the specified pattern. If the modifier g is set in the regular expression, all the autobiographies that match the pattern in the source string are replaced with the string specified by the second argument, and all matching first substrings are replaced with no modifier g. If the first parameter of replace () is a string instead of a regular expression, replace () searches for the string directly, rather than converting it to a regular expression, like search (), first by RegExp (). For example, you can use the following method to replace all JavaScript (case-insensitive) in text with "JavaScript" using replace ():

Text.replace (/javascript/gi, "JavaScript");

But the functionality of replace () is far from being the case. The parentheses in a regular expression are enclosed with a left-to-right index number, and the regular expression remembers the text that matches each word expression in the form. If these two strings appear in the replacement string. This is a very useful feature. For example, you can use it to replace the English quotation marks in a string with the Chinese half quotation marks.

A reference literal begins with quotation marks, ending in quotation marks, and the middle content cannot contain quotes
var quote =/"(^" *) "/g;
Replace the English quotation marks with the Chinese half quotation marks while keeping the contents between the quotes (stored in $) unmodified
Text.replace (Quote, ' "$");



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.