\w in regular expressions cannot recognize Chinese

\w in regular expressions cannot recognize Chinese _ regular expressions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular expressions are used for string processing, form validation, and so on, practical and efficient, but often not too sure to use, so it is always necessary to search the Internet. I've collected some commonly used expressions here for the purposes of my memo. This post will be updated at any time.
Matching regular expressions for Chinese characters: [\U4E00-\U9FA5]
Match Double-byte characters (including Chinese characters): [^\x00-\xff]
Application: Computes the length of the string (a double-byte character length meter 2,ascii character 1)
String.prototype.len=function () {return This.replace ([^\x00-\xff]/g, "AA"). Length;}
A regular expression that matches a blank row: \n[\s|] *\r
Regular expression matching HTML tags:/< (. *) >.*<\/\1>|< (. *) \/>/
Matching a regular expression with a trailing space: (^\s*) | (\s*$)
String.prototype.trim = function ()
{
Return This.replace (/(^\s*) | ( \s*$)/g, "");
}
To decompose and transform an IP address using a regular expression:
The following is a JavaScript program that uses a regular expression to match an IP address and converts an IP address to a corresponding numeric value:
function IP2V (IP)
{
re=/(\d+) \. (\d+) \. (\d+) \. (\d+)/g//matching the regular expression of the IP address
if (Re.test (IP))
{
Return Regexp.$1*math.pow (255,3)) +regexp.$2*math.pow (255,2)) +regexp.$3*255+regexp.$4*1
}
Else
{
throw new Error ("not a valid IP address!")
}
}
However, if the above program does not use a regular expression, and the split function directly to decompose may be simpler, the program is as follows:
var ip= "10.100.20.168"
Ip=ip.split (".")
Alert ("IP value is:" + (IP[0]*255*255*255+IP[1]*255*255+IP[2]*255+IP[3]*1))
Regular expression matching an email address: \w+ ([-+.] \w+) *@\w+ ([-.] \w+) *\.\w+ ([-.] \w+) *
A regular expression that matches URL URLs: http://([\w-]+\.) +[\w-]+ (/[\w-/?%&=]*)?
Using regular expressions to remove the repeated characters in the string algorithm program: [Note: This program is incorrect, the reason see this post reply]
var s= "Abacabefgeeii"
var s1=s.replace (/(.). *\1/g, "$")
var re=new RegExp ("[" +s1+ "]", "G")
var s2=s.replace (Re, "")
Alert (S1+S2)//Result: ABCEFGI
I used to post on the csdn to find an expression to achieve the elimination of repeated characters, and ultimately did not find, this is the simplest way I can think of implementation. The idea is to use a back reference to take out the characters that contain duplicates, and then to create a second expression with a repeating character, with a concatenation of the characters that are not repeated. This method may not apply to strings that are required for character order.
You have to use regular expressions to extract the filename from the URL address of the JavaScript program, the following result is Page1
S= "Http://www.9499.net/page1.htm"
S=s.replace (/(. *\/) {0,} ([^\.] +). */ig, "$"
Alert (s)
Use regular expressions to restrict the entry of text boxes in a Web page's form:
The regular expression limit can only be entered in Chinese: onkeyup= "value=value.replace (/[^\u4e00-\u9fa5]/g,") "Onbeforepaste=" Clipboarddata.setdata (' Text ', Clipboarddata.getdata (' text '). Replace (/[^\u4e00-\u9fa5]/g, ') "
Only full-width characters can be entered with regular expression restrictions: onkeyup= "Value=value.replace (/[^\uff00-\uffff]/g,") "Onbeforepaste=" Clipboarddata.setdata (' Text ', Clipboarddata.getdata (' text '). Replace (/[^\uff00-\uffff]/g, ') "
Only numbers can be entered with regular expression restrictions: onkeyup= "Value=value.replace (/[^\d]/g,") "Onbeforepaste=" Clipboarddata.setdata (' text ', Clipboarddata.getdata (' text '). Replace (/[^\d]/g, ') "
Only numbers and English can be entered with regular expression restrictions: onkeyup= "Value=value.replace (/[\w]/g,") "Onbeforepaste=" Clipboarddata.setdata (' text ', Clipboarddata.getdata (' text '). Replace (/[^\d]/g, ') "
------------------------------------------
In addition from Baidu's know inside get some information:
The construction summary of regular expressions
Construct match
Character
X character X
\ backslash Character
\0n with octal value 0 of the character n (0 <= n <= 7)
\0nn with octal value 0 of the character nn (0 <= n <= 7)
\0mnn characters with octal value 0 mnn (0 <= m <= 3, 0 <= n <= 7)
\XHH characters with hexadecimal value of 0x hh
\uhhhh characters with hexadecimal value of 0x HHHH
\ t tab (' \u0009 ')
\ n New Line (newline) character (' \u000a ')
\ r return character (' \u000d ')
\f page feed (' \u000c ')
\a Alarm (Bell) character (' \u0007 ')
\e Escape character (' \u001b ')
\CX corresponds to the control character of X
Character class
[ABC] A, B, or C (simple Class)
[^ABC] Any character except A, B, or C (negation)
[A-za-z] A to Z or A to Z, and the letters at both ends are included (range)
[A-d[m-p]] A to D or M to P:[a-dm-p] (and set)
[A-z&&[def]] D, E or F (intersection)
[A-Z&AMP;&AMP;[^BC]] A to Z, except B and C:[ad-z] (minus)
[A-z&&[^m-p]] A to Z, not M to P:[a-lq-z] (minus)
Predefined character classes
. Any character (may or may not match the line terminator)
\d number: [0-9]
\d Non-digit: [^0-9]
\s whitespace characters: [\t\n\x0b\f\r]
\s non-whitespace characters: [^\s]
\w Word characters: [a-za-z_0-9]
\w non-word characters: [^\w]
POSIX character class (Us-ascii only)
\p{lower} lowercase alphabetic characters: [A-z]
\p{upper} uppercase characters: [A-z]
\P{ASCII} all ascii:[\x00-\x7f]
\p{alpha} alphabetic characters: [\p{lower}\p{upper}]
\p{digit} decimal digits: [0-9]
\p{alnum} alphanumeric characters: [\p{alpha}\p{digit}]
\P{PUNCT} punctuation:! " #$%& ' () *+,-./:;<=>?@[\]^_ ' {|} ~
\p{graph} visible characters: [\p{alnum}\p{punct}]
\p{print} printable characters: [\p{graph}\x20]
\p{blank} spaces or tabs: [\ t]
\p{cntrl} control character: [\x00-\x1f\x7f]
\p{xdigit} hexadecimal number: [0-9a-fa-f]
\p{space} whitespace characters: [\t\n\x0b\f\r]
Java.lang.Character Class (Simple Java character type)
\p{javalowercase} is equivalent to Java.lang.Character.isLowerCase ()
\p{javauppercase} is equivalent to Java.lang.Character.isUpperCase ()
\p{javawhitespace} is equivalent to Java.lang.Character.isWhitespace ()
\p{javamirrored} is equivalent to java.lang.Character.isMirrored ()
Classes for Unicode blocks and categories
Characters in \p{ingreek} Greek blocks (simple blocks)
\p{lu} Capital Letter (Simple category)
\P{SC} currency symbol
\p{ingreek} All characters, except in the Greek block (negation)
[\p{l}&&[^\p{lu}]] All letters except uppercase letters (minus)
Boundary Matching Device
^ The beginning of a line
$ End of line
\b Word boundaries
\b Non-word boundaries
\a the beginning of the input
\g the end of the previous match
\z the end of the input, only for the last terminator (if any)
End of \z input
Greedy quantity Word
X? X, not once or once
X* X, 0 or more times
x+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n times, but not more than m times
Reluctant quantity word
X?? X, not once or once
X*? X, 0 or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n times, but not more than m times
Possessive quantity Word
x?+ X, once or once there is no
x*+ X, 0 or more times
x + + x., one or more times
x{n}+ X, exactly n times
x{n,}+ X, at least n times
x{n,m}+ X, at least n times, but not more than m times
Logical operator
XY X followed by Y
X| Y X or Y
(x) x, as a capturing group
Back reference
\ n Any matching nth capture group
Reference
\ Nothing, but the following characters are referenced
\q nothing, but references all characters until \e
\e nothing, but ending a reference starting from \q
Special construction (not capture)
(?: x) x, as a non-capturing group
(? idmsux-idmsux) Nothing, but converts the matching flag from on to off
(? idmsux-idmsux:x) X, as a on-off group with a given flag
(? =x) X, through a positive lookahead of 0 widths
(?! x) x, through a negative lookahead of 0 widths
(? <=x) X, through a positive lookbehind of 0 widths
(? <! x) x, through a negative lookbehind of 0 widths
(? >x) X, as a separate, non-capturing group
--------------------------------------------------------------------------------
Backslashes, escapes, and references
The backslash character (' \ ') is used to reference the escaped construct, as defined in the previous table, and to refer to other characters that will be interpreted as non-escaped constructs. Therefore, the expression \ \ Matches a single backslash, and \{matches the left parenthesis.
It is wrong to use backslashes before any alphabetic characters that escape constructs are used, and they are reserved for future extensions of regular expression languages. You can use a backslash before a non-alphanumeric character, regardless of whether the character is part of an escaped construct or not.
The backslash in the Java source code string is interpreted as Unicode escape or other character escape, as required by the Java Language specification. Therefore, you must use two backslashes in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, and "\\b" matches the word boundary. string literal "\ (hello\)" is illegal and will result in a compile-time error; to match the string (hello), you must use string literal "\ (hello\\)".
Character class
A character class can appear in other character classes, and can contain a set operator (implicit) and an intersection operator (&&). The collection operator represents a class that contains at least one of its operand classes. The intersection operator represents a class that contains all the characters in its two operand classes.
The precedence of the character class operators is as follows, in order from highest to lowest:
1-literal escape \x
2 Grouping [...]
3 Range A-Z
4 and set [A-e][i-u]
5 intersection [A-z&&[aeiou]]
Note that the different sets of metacharacters are actually inside the character class, not outside of the character class. For example, regular expressions. It loses its special meaning inside the character class, and the expression-becomes the range that forms the meta character.
Line Terminator
A line terminator is a sequence of one or two characters that marks the end of the line of the input character sequence. The following code is recognized as a line terminator:
New Lines (newline) (' \ n '),
The carriage return immediately followed by the new line character ("\ r \ n"),
A separate carriage return (' \ R '),
Next line of characters (' \u0085 '),
Row delimiter (' \u2028 ') or
Paragraph separator (' \u2029 ').
If the unix_lines mode is activated, the new line character is the only recognized line terminator.
If the DOTALL flag is not specified, the regular expression. can match any character (except the line terminator).
By default, regular expressions ^ and $ ignore line terminators, matching only the beginning and end of the entire input sequence. If the MULTILINE mode is activated, then ^ matches at the beginning of the entry and after the line terminator (the end of the input). When in MULTILINE mode, $ matches only before the row terminator or at the end of the input sequence.
Groups and captures
Capturing groups can be numbered by counting their open brackets from left to right. For example, in an expression ((A) (B (C)), there are four such groups:
1 ((A) (B (C)))
2 \a
3 (B (C))
4 (C)
Group 0 always represents an entire expression.
The capture group is named so that each subsequence of the input sequence that matches the groups is saved in the match. The captured subsequence can later be used in an expression by a back reference, or retrieved from the match after the matching operation completes.
The capture input associated with a group is always the child sequence that matches the group most recently. If the group is recalculated again for quantification, the value that was previously captured (if any) will be preserved if the second calculation fails, for example, to "ABA" the string with an expression (a (b)) + matches, the second group is set to "B". At the beginning of each match, all captured inputs are discarded.
The group that begins with (?) is a pure, non-capturing group that does not capture text and is not counted for group totals.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

\w in regular expressions cannot recognize Chinese _ regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

\w in regular expressions cannot recognize Chinese _ regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support