Regular Expression Note: extract Chinese information in C # code, double quotation marks, and information between angle brackets

Source: Internet
Author: User

1. Duplicate table tags

* Indicates that the previous character must appear 0 times, once, or even multiple times. The top is not capped, and the bottom is 0 times.

+ Indicates that the preceding character must appear once or multiple times. The character is not capped, but must appear once at least.

? It indicates that the first character must appear 0 times, once, And once. The top is 1 time, and the bottom is 0 times.

 

2. Other Symbols

\ Escape

. Indicates matching any single character. It seems that you cannot remember it except the carriage return. You have to check the document.

^ Indicates that the entire string starts with a certain character, such as ^ t. The entire string is very important. It can only start.

$ Indicates that the entire string ends with a certain character. For example, F $ indicates that the entire string is very important and can only be at the end.

[...] Indicates any character in the brackets.
| Indicates the selected symbol. "gray | gray" can match gray or gray.

() Indicates the operation range and priority. For example, "GR (A | E) y" can match gray or gray.

{N} matches the first character n times, {n,} n times or unlimited times, and {n, m} matches at least N times and at most m times.

\ S any blank character

\ S any non-blank character

\ W any word character

\ W any non-word character

\ D any number

\ D any non-digit

 

3. Expression for extracting Chinese information.

RegEx RX = new RegEx ("[\ u4e00-\ u9fa5] + ");

 

4. Expression for extracting information between double quotation marks.

RegEx RX = new RegEx ("\" [^ \ "] * \" ");

Note: The meaning of ^ in [] has changed. It does not start with a certain character, but cannot contain the following characters. This is called assense.

 

5. assense

Sometimes you need to find characters that do not belong to a simple character class. For example, if you want to search for any character except a number, you need to use the negative sense:

Table 3. Commonly Used assigneesCode 

Code/syntax description

\ W matches any character that is not a letter, number, underline, or Chinese Character

\ S matches any character that is not a blank character

\ D matches any non-numeric characters

\ B matching is not the start or end position of a word

[^ X] matches any character except x

[^ Aeiou]

Match any character except aeiou

 

6. expression used to extract information between double quotation marks. The second method is used.

RegEx RX = new RegEx ("\".*? \"");

Note: Add? The function is to change the regular expression from greedy mode to lazy mode.

 

7. Greed and laziness

When a regular expression contains a qualifier that can accept duplicates, the common behavior is to match as many characters as possible (on the premise that the entire expression can be matched. Consider this expression: A. * B, which will match the longest string starting with a and ending with B. If you use it to search for aabab, it will match the entire string aabab. This is called greedy matching.

Sometimes, we need to be more lazy to match, that is, to match as few characters as possible. All the qualifiers given above can be converted to the lazy match mode, as long as a question mark is added after it ?. This way .*? This means to match any number of duplicates, but use the minimum number of duplicates if the entire match is successful. Now let's look at the lazy version example:

A .*? B matches the string that is shortest, starts with a, and ends with B. If it is applied to aabab, it will match AAB (first to third character) and AB (fourth to fifth character ).

Why is the first match AAB (the first to the third character) rather than AB (the second to the third character )? Simply put, because a regular expression has another rule, it has a higher priority than a lazy/greedy rule: the first match to start has the highest priority-the match that begins earliest wins.

 

8. In addition, Lanyi trajectory also has a method to match double quotation marks, is to convert the quotation marks into hexadecimal numbers, please refer to: http://www.cnblogs.com/twh/articles/1629752.html

RegEx RX = new RegEx ("\ u0022 .*? \ U0022 ");

 

9. Extract Chinese information from the xml configuration file for international code.

Chinese information, attributes, and node values may appear in two places. Therefore, you can use double quotation marks to match a file and extract the Chinese information contained in the attribute values. Then, match the file with Angle brackets to extract the Chinese information contained in the node value. The regular RX = new RegEx (">. * <");

If only Chinese characters are matched, use RegEx RX = new RegEx ("[\ u4e00-\ u9fa5] +"). However, if Chinese characters contain punctuation marks or English characters, this is useless. You can use ismatch to determine whether the information contains Chinese characters for the purpose of trade-offs. The example is as follows:

RegEx rxchinacharacter = new RegEx ("[\ u4e00-\ u9fa5] + ")

RegEx RX = new RegEx (">. * <");
System. Text. regularexpressions. matchcollectionmatchs = Rx. Matches (input );
If (matchs. Count! = 0)
{
Foreach (Match m in matchs)
{
If (! Rxchinacharacter. ismatch (M. Value ))

{

// If the conditions include Chinese characters, you can stay.

}

}

}

10. Summary

^ Has two meanings:

1. indicates that the entire string starts with a certain character, such as ^ t. The entire string is very important. It can only start.

2. Put it in [^ t] to indicate that it cannot contain t characters.

 

? There are also two meanings:

1. It indicates that the previous character must appear 0 times, 1 time, and the top is 1 time. The bottom is 0 times, and can not appear.

2. Change the regular expression from greedy mode to lazy mode.

 

$ Indicates that the entire string ends with a certain character, such as F $. The entire string is very important. It can only be at the end, for example, if the assssbtattttb string matches a [^ A] * B $, it can only match the last half of attttb, for example, ^ A [^ A] * B, only the first half of assssb can be matched. If ^ A [^ A] * B $ is used for matching, nothing can be matched at all.

 

Recommended materials:

Getting started with regular expressions in 30 minutes
Http://manual.phpv.net/regular_expression.html
C # Regular Expression preparation
Http://www.cnblogs.com/kissknife/archive/2008/03/23/1118423.html
How does a regular expression match double quotation marks?
Http://www.cnblogs.com/twh/articles/1629752.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.