\ W cannot recognize Chinese Characters in Regular Expressions

Source: Internet
Author: User
Tags character classes alphanumeric characters

Regular expressions are used for string processing, form verification, and other occasions. They are practical and efficient, but they are always not sure when used, so they often need to be checked online. I will add some frequently-used expressions to my favorites for memo. This post is updated at any time.
Regular Expression matching Chinese characters: [\ u4e00-\ u9fa5]
Match double-byte characters (including Chinese characters): [^ \ x00-\ xff]
Application: Calculate the length of a string (two-byte length Meter 2, ASCII character meter 1)
String. prototype. len = function () {return this. replace ([^ \ x00-\ xff]/g, "aa"). length ;}
Regular Expression for matching empty rows: \ n [\ s |] * \ r
Regular Expressions matching HTML tags:/<(. *)>. * <\/\ 1> | <(. *) \/>/
Regular Expression matching spaces at the beginning and end: (^ \ s *) | (\ s * $)
String. prototype. trim = function ()
{
Return this. replace (/(^ \ s *) | (\ s * $)/g ,"");
}
Use regular expressions to break down and convert IP addresses:
The following is a Javascript program that uses regular expressions to match IP addresses and convert IP addresses to corresponding values:
Function IP2V (ip)
{
Re =/(\ d +) \. (\ d +)/g // Regular Expression matching IP addresses
If (re. test (ip ))
{
Return RegExp. $1 * Math. pow (255) + RegExp. $2 * Math. pow () + RegExp. $3 * + RegExp. $4*1
}
Else
{
Throw new Error ("Not a valid IP address! ")
}
}
However, if the above program does not use regular expressions, it may be easier to directly use the split function to separate them. The program is as follows:
Var ip = "10.100.0000168"
Ip = ip. split (".")
Alert ("the IP value is: "+ (ip [0] * 255*255*255 + ip [1] * 255*255 + ip [2] * 255 + ip [3] * 1 ))
The regular expression matching the Email address: \ w + ([-+.] \ w +) * @ \ w + ([-.] \ w + )*\. \ w + ([-.] \ w + )*
The regular expression matching the URL: http: // ([\ w-] + \.) + [\ w-] + (/[\ w -./? % & =] *)?
An algorithm program that uses regular expressions to remove repeated characters in a string: [Note: This program is incorrect. For the reason, see this post.]
Var s = "abacabefgeeii"
Var s1 = s. replace (/(.). * \ 1/g, "$1 ")
Var re = new RegExp ("[" + s1 + "]", "g ")
Var s2 = s. replace (re ,"")
Alert (s1 + s2) // The result is: abcefgi
I used to post on CSDN to seek an expression to remove repeated characters, but I couldn't find it. This is the simplest implementation method I can think. The idea is to use the back-to-back reference to retrieve repeated characters, then create a second expression with repeated characters, get non-repeated characters, and connect the two. This method may not apply to strings with character order requirements.
Javascript programs that extract file names from URLs using regular expressions. the following result is page1.
S = "http://www.9499.net/page1.htm"
S = s. replace (/(. * \/) {0,} ([^ \.] +). */ig, "$2 ")
Alert (s)
Use regular expressions to restrict text box input in a webpage form:
You can only enter Chinese characters using regular expressions: onkeyup = "value = value. replace (/[^ \ u4E00-\ u9FA5]/g, '')" onbeforepaste = "clipboardData. setData ('text', clipboardData. getData ('text '). replace (/[^ \ u4E00-\ u9FA5]/g ,''))"
You can only enter the full-width characters: onkeyup = "value = value. replace (/[^ \ uFF00-\ uFFFF]/g, '')" onbeforepaste = "clipboardData. setData ('text', clipboardData. getData ('text '). replace (/[^ \ uFF00-\ uFFFF]/g ,''))"
Use a regular expression to limit that only numbers can be entered: onkeyup = "value = value. replace (/[^ \ d]/g, '')" onbeforepaste = "clipboardData. setData ('text', clipboardData. getData ('text '). replace (/[^ \ d]/g ,''))"
You can only enter numbers and English letters using regular expressions: onkeyup = "value = value. replace (/[\ W]/g, '')" onbeforepaste = "clipboardData. setData ('text', clipboardData. getData ('text '). replace (/[^ \ d]/g ,''))"
------------------------------------------
In addition, you can get some information from baidu:
Constructor of a regular expression
Construct matching
Character
X characters x
\ Backslash character
\ 0n CHARACTER n with an octal value of 0 (0 <= n <= 7)
\ 0nn: nn (0 <= n <= 7) character with a octal value of 0)
\ 0mnn: mnn (0 <= m <= 3, 0 <= n <= 7)
\ Xhh character with hexadecimal value 0x hh
\ Uhhhh character with hexadecimal value 0x hhhh
\ T tab ('\ u0009 ')
\ N New Line (line feed) character ('\ u000a ')
\ R carriage return ('\ u000d ')
\ F form feed ('\ u000c ')
\ A alarm (bell) character ('\ u0007 ')
\ E escape character ('\ u001B ')
\ Cx control letter corresponding to x
Character class
[Abc] a, B, or c (simple class)
[^ Abc] any character except a, B, or c (NO)
[A-zA-Z] letters from a to z or from A to Z are included in the range)
[A-d [m-p] a to d or m to p: [a-dm-p] (union)
[A-z & [def] d, e, or f (intersection)
[A-z & [^ bc] a to z, except for B and c: [ad-z] (minus)
[A-z & [^ m-p] a to z, instead of m to p: [a-SCSI-z] (minus)
Predefined character classes
. Any character (may or may not match the line terminator)
\ D Number: [0-9]
\ D non-numeric: [^ 0-9]
\ S blank character: [\ t \ n \ x0B \ f \ r]
\ S non-blank characters: [^ \ s]
\ W word character: [a-zA-Z_0-9]
\ W non-word characters: [^ \ w]
POSIX character class (US-ASCII only)
\ P {Lower} lowercase letter: [a-z]
\ P {Upper} uppercase letter: [A-Z]
\ P {ASCII} All ASCII: [\ x00-\ x7F]
\ P {Alpha} letter: [\ p {Lower} \ p {Upper}]
\ P {Digit} decimal number: [0-9]
\ P {Alnum} alphanumeric characters: [\ p {Alpha} \ p {Digit}]
\ P {Punct} punctuation :! "# $ % & '() * +,-./:; <=>? @ [\] ^ _ '{| }~
\ P {Graph} visible characters: [\ p {Alnum} \ p {Punct}]
\ P {Print} printable character: [\ p {Graph} \ x20]
\ P {Blank} space or tab: [\ t]
\ P {Cntrl} Control Character: [\ x00-\ x1F \ x7F]
\ P {XDigit} hexadecimal number: [0-9a-fA-F]
\ P {Space} blank characters: [\ t \ n \ x0B \ f \ r]
Java. lang. Character class (simple java Character type)
\ P {javaLowerCase} is equivalent to java. lang. Character. isLowerCase ()
\ P {javaUpperCase} is equivalent to java. lang. Character. isUpperCase ()
\ P {javaWhitespace} is equivalent to java. lang. Character. isWhitespace ()
\ P {javaMirrored} is equivalent to java. lang. Character. isMirrored ()
Unicode block and category class
Characters in \ p {InGreek} Greek blocks (simple blocks)
\ P {Lu} uppercase letters (simple type)
\ P {SC} currency symbol
\ P {InGreek} All characters except (NO) in the Greek Block)
[\ P {L} & [^ \ p {Lu}] All letters, except uppercase letters (minus)
Boundary
^ Beginning of a row
$ End of a row
\ B word boundary
\ B Non-word boundary
\
End of a match on \ G
The end of the \ Z input. It is only used for the final terminator (if any)
\ Z input end
Greedy quantifiers
X? X, neither once nor once
X * X, zero or multiple times
X + X, once or multiple times
X {n} X, EXACTLY n times
X {n,} X, at least n times
X {n, m} X, at least n times, but not more than m times
Reluctant quantifiers
X ?? X, neither once nor once
X *? X, zero or multiple times
X ++? X, once or multiple times
X {n }? X, EXACTLY n times
X {n ,}? X, at least n times
X {n, m }? X, at least n times, but not more than m times
Possessive quantifiers
X? + X, neither once nor once
X * + X, zero or multiple times
X ++ X, once or multiple times
X {n} + X, EXACTLY n times
X {n,} + X, at least n times
X {n, m} + X, at least n times, but not more than m times
Logical operators
Xy x followed by Y
X | y x or Y
(X) X, used as the capture group
Back Reference
\ N any matching nth capture group
Reference
\ Nothing, but references the following characters
\ Q Nothing, but references all characters until \ E
\ E Nothing, but end reference starting from \ Q
Special Structure (non-capturing)
(? : X) X, used as a non-capturing Group
(? Idmsux-idmsux) Nothing, but changes the matching flag from on to off.
(? Idmsux-idmsux: X) X, used as a non-capturing group with the given flag on-off
(? = X) X, through the zero-width positive lookahead
(?! X) X, using a zero-width negative lookahead
(? <= X) X, using a zero-width positive lookbehind
(? <! X) X, using a zero-width negative lookbehind
(?> X) X, used as an independent non-capturing Group
--------------------------------------------------------------------------------
Backslash, escape, and reference
The backslash ('\') is used to reference the escape structure, as defined in the preceding table. It is also used to reference other characters that will be interpreted as non-escape structures. Therefore, the expression \ matches a single backslash, and \ {matches the left parenthesis.
It is wrong to use the backslash before any letter characters that do not represent escape structures; they are reserved for future extension of the Regular Expression Language. You can use a backslash before a non-letter character, regardless of whether the character is a part of a non-escape structure.
According to the requirements of Java Language Specification, the backslash in the string of Java source code is interpreted as Unicode escape or other character escape. Therefore, two backslashes must be used in the string literal value, indicating that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal value "\ B" matches a single backspace character, while "\ B" matches the word boundary. The string literal value "\ (hello \)" is invalid and may cause compilation errors. It must match the string (hello, the string literal value "\ (hello \)" must be used \\)".
Character class
Character classes can appear in other character classes, and can contain Union operators (implicit) and intersection operators (&&). The Union operator indicates a class that contains at least all characters in an operand class. The intersection operator represents a class that contains all characters in both its two operand classes.
The priority of character-class operators is as follows:
1-character nominal value escape \ x
2 groups [...]
The value range is a-z.
4. Union [a-e] [I-u]
5 intersection [a-z & [aeiou]
Note that different sets of metacharacters are actually inside the character class, rather than outside the character class. For example, a regular expression loses its special meaning in the character class, and the expression-becomes the range that forms metacharacters.
Line terminator
A row Terminator is a sequence of one or two characters that marks the end of a row in the input character sequence. The following code is recognized as a line terminator:
New Line (line feed) character ('\ n '),
The carriage return ("\ r \ n "),
Separate carriage returns ('\ R '),
The next line of characters ('\ u0085 '),
Line separator ('\ u2028') or
Paragraph separator ('\ u2029 ).
If UNIX_LINES mode is activated, the new line is the only recognized line terminator.
If the DOTALL flag is not specified, the regular expression can match any character (except the line terminator.
By default, regular expressions ^ and $ ignore the row Terminator, which only match the beginning and end of the entire input sequence. If the MULTILINE mode is activated, ^ matches the start of the input and the end of the line (the end of the input. In MULTILINE mode, $ matches only before the row terminator or the end of the input sequence.
Group and capture
The capture group can be numbered from left to right by calculating its parentheses. For example, in expression (A) (B (C), there are four such groups:
1 (A) (B (C )))
2 \
3 (B (C ))
4 (C)
Group zero always represents the entire expression.
The reason for naming the capture group is that each sub-sequence of the input sequence that matches these groups is saved in the match. The captured sub-sequence can be used in the expression through Back reference later, or it can be retrieved from the matcher after the matching operation is completed.
The capture input associated with the group is always the child sequence that is most recently matched with the group. If the group is calculated again due to quantification, the previously captured value (if any) will be retained when the second calculation fails. For example, convert the string "aba" with the expression (a (B )?) + If it matches, the second group is set to "B ". At the beginning of each match, all captured input is discarded.
Take (?) The group at the beginning is a pure non-capturing group, which does not capture text or count the combination.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.