\b Word Boundary _ regular expression of regular foundation

Source: Internet
Author: User
Tags html tags
1 Overview
"\b" matches word boundaries and does not match any characters.
"\b" matches only one position, the side of which is the character that makes up the word, the other side is a non word character, the beginning or end position of the string. "\b" is 0 width.
Basically all the data will say "\b" is the word boundary, but the scope of "word" is rarely mentioned. In general, the so-called "word" in a regular expression is a substring of the characters defined by "\w".
"\b" means that one side of the position is a word character, the other is a non word character, the beginning or end of the string, and the equivalent
(? <!\w) (? =\w) | (? <=\w) (?! \w)
Thinking: Why is the following notation not equivalent to "\b"
(? <=\w) (? =\w) | (? <=\w) (? =\w)
the scope of the 2\w
When it comes to "\w", it is necessary to examine its scope first.
In languages that support ASCII code, such as JavaScript, "\w" is equivalent to [a-za-z0-9_];
In languages that support Unicode, such as. NET, by default, "\w" can match some Unicode character sets, such as Chinese characters, full-width digits, and so on, in addition to matching [a-za-z0-9_].
Almost all common languages follow such a rule that Java is the exception. In Java, "\w" performance is Strange, Java is to support Unicode, but Java's regular "\w" is equivalent to [a-za-z0-9_].
Let's take a look at some examples of how "\w" matches in several languages.
Javascript
Copy Code code as follows:

<script language= "JavaScript" >
var str = "abc_123 Chinese _D3=EFG character%";
var reg =/\w+/g;
var arr = Str.match (reg);
if (arr!= null)
{
for (Var i=0;i<arr.length;i++)
{
document.write (Arr[i] + "<br/>");
}
}
</script>
Output in JavaScript
Abc_123
_d3
Efg

C#
Copy Code code as follows:

String test = "abc_123 Chinese _D3=EFG Chinese characters%";
MatchCollection mc = regex.matches (Test, @ "\w+");
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n";
}
Output in C #
Abc_123 Chinese _d3
EFG Chinese Characters

Java
Copy Code code as follows:

String test = "abc_123 Chinese _D3=EFG Chinese characters%";
String reg = "\\w+";
Matcher m = Pattern.compile (reg). Matcher (test);
while (M.find ())
{
System.out.println (M.group ());
}
Output in Java
Abc_123
_d3
Efg

As you can see, the output of "\w" in Java is the same as in JavaScript, and is supported only by ASCII characters.
3 range of \b
The scope of "\w" in common language is determined, so is it possible to think that the matching range of "\b" is consistent with "\w"?
Take another look at the following example:
SOURCE string: abc_123 Chinese _d3= kanji EFG
Regular expression:. \b.
Javascript
Copy Code code as follows:

<script language= "JavaScript" >
var str = "abc_123 Chinese _D3=EFG character%";
var reg =/.\b./g;
var arr = Str.match (reg);
if (arr!= null)
{
for (Var i=0;i<arr.length;i++)
{
document.write (Arr[i] + "<br/>");
}
}
</script>
Output in JavaScript
3 in
Text
3=
G-Han

C#
Copy Code code as follows:

String test = "abc_123 Chinese _D3=EFG Chinese characters%";
MatchCollection mc = regex.matches (Test, @ ". \b.");
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n";
}
Output in C #
3=
Word

Java
Copy Code code as follows:

String test = "abc_123 Chinese _D3=EFG Chinese characters%";
String reg = ". \\b.";
Matcher m = Pattern.compile (reg). Matcher (test);
while (M.find ())
{
System.out.println (M.group ());
}
Output in Java
3=
Word

As you can see, the Java output is consistent with. NET, and "\b" in Java supports Unicode.
So in general, the "\w" in Java is strange, and "\b" is consistent with other languages and needs to be noted when used.
4 \b Application Scenario
4.1 Basic applications
"\b" generally applies when a substring of a word character needs to be matched, but the character cannot be contained in a longer substring of the same word character.
For example, to replace the word "to" in English, and "Today" is obviously not in the range of replacement, so can be used "\bto\b" to limit.
More scenes are used in matching HTML tags to differentiate between tags, such as <b>, </b>, <p...>, , and so on, but keep <br/> tags, It can be written as "< (/?B|P|IMG) \b[^>]*>".
For example: Count the number of "3" in an element divided by ","
String test = "137,1,33,4,3,6,21,3,35,93,2,98";
int count = regex.matches (Test, @ "\b3\b"). Count; Results: 2
4.2 Advanced Application
Slightly more complex applications are often used in conjunction with some other regular grammar rules, referring to a post
To find a regular expression
4.3 Special Circumstances
"\b" is used in regular, usually in the case of a word boundary, only in the character group, which is the backspace key, that is,
[a-z\b]
The "\b" here represents the BACKSPACE key, not the word boundary.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.