Regular Expression learning materials in C #

Source: Internet
Author: User

Regular Expression in C #

Jeffrey E.F. Friedl wrote a book about regular expressions, proficient in regular expressions. The author fabricated a story to help readers better understand and master regular expressions. This book is mainly based on perl. As far as I know, the regular expression in C # is also based on perl5. So they should have many commonalities. Http://ike.126.com
In fact, I do not intend to translate the content of the book as it is. I am not competent to translate this book because it contains too much content. Second, if I really translate this book, at the same time, the code inside is changed to C #, and there may be infringement suspicion without obtaining the original author. Therefore, right should be taken as a reading note.

After skipping the lengthy preface, We can go directly to Chapter 1:

Introduction to Regular Expressions

The author said that this chapter was prepared for the regular expression, which is intended to lay a solid foundation for future chapters. If you are not a cainiao, you can ignore this chapter.

Story scenarios:
The head of your archive department wants a tool to check duplicate words (such as this), a problem that is commonly encountered when a large number of documents are edited. Your job is to create a solution:
Accept any number of files to be checked, report the rows with duplicate words in each file, highlight these duplicate words, and ensure that the original file name and the rows appear in the report.
Cross-row check. Find the last word in a row and the first word at the beginning of the next line that is repeated.
Find duplicate words, whether or not they are case-insensitive (such as: The), and allow them to contain different numbers of white spaces (spaces, tabs, new lines, etc)
Find duplicate words, and even these words are separated by Html tags. (For example :... It is <B> very </B> very important .)

To solve the above problems, we must first write a regular expression, find the desired text, and ignore the text we don't need, then we use our C # code to process the obtained text.

Before using a regular expression, you may already know what a regular expression is. Even if you don't know, you are almost certainly familiar with its basic concepts.
You know that report.txt is a specific file name, but if you have any Unix or DOS/Windows experience, you also know that "*. txt" can be used to select multiple files. This form of file name has some special characters. Asterisks mean matching anything, and question marks mean matching a character. For example, "*. txt. txt.
The file name must match in the pattern, and a limited match character is used. In addition, the search engine on the current network can also use some specified matches for content search. Regular Expressions use a variety of matching characters to handle various complex problems.

First, we will introduce two location match characters:
^: Start position of a line of text
$: End position of a line of text

For example, expression: "^ Cat". The matched word Cat appears at the beginning of the row. Note that ^ is a positional character rather than the character itself.
Similarly, the expression "Cat $" matches the word Cat and appears at the end of a line.

Next, we will introduce square brackets "[]" in the expression to match one of the characters in the brackets. For example:
Expression: "[0123456789]" matches any number ranging from 0 to 9.
For example, if we want to find all the gray or gray contained in the text, the expression can be written as follows: "gr [ea] y"
[Ea] indicates matching one of the ea, not the entire ea.

If we want to match the <H1> <H2> <H3> <H4> <H5> <H6> tag in html, we can write the expression:
"<H [123456]>", but what if we want to match one of all characters? Haha, the question is, write all the characters in square brackets? Fortunately, we don't have to do this. We introduce the range symbol "-";
To use the range symbol, we only need to provide a boundary character of the range. In the preceding Html example, we can write it as: "<H [1-6]>"
The expression "[0-9a-zA-Z]" is clear now, right? It matches numbers, 26 lower-case letters, and 26 upper-case letters.

"^" Symbol displayed in []
If you see the expression "[^ 0-9]", "^" is no longer the position symbol mentioned above. Here it is a negative symbol, indicating exclusion, the above expression does not contain characters ranging from 0 to 9.

Thought 1: expression "q [^ u. Which of the following words will be matched?
Iraqi
Iraqian
Miqra
Qasida
Qintar
Qoph
Zaqqum


In addition to the expression of the range character, there is also a dot character ".", the dot character appears in the expression, indicating that match any character.
For example, the expression "07.04.76" will match:
For example, 07/04/76, 07-04-76, and 07.04.76.

If you need to select among some characters, you can use the option character "| ":
The option character indicates "or". For example, if the expression is "[Bob | Robert]", Bob or Robert will be matched.
Now let's look at the expression we mentioned above: "gr [ea] y". With option characters, we can write "gray | gray", which are the same.
Use of parentheses: parentheses are also used as metacharacters in expressions. For example, the preceding expression can be written as: "gr (e | a) y ", parentheses are required. If no parentheses exist, the expression "gre | ay" matches gre or ay, which is not the expected result. If you are not quite clear, let's take a look at the following example:
Find all rows starting with "From:", "Subject:", or "Date:" in the email. We will compare the following two expressions:
Expression 1: "^ From | Subject | Data :"
Expression 2: "^ (From | Subject | Data ):"
Which one is what we want?
Obviously, the result of expression 1 is not what we want, and it matches: From, Subjec, or Data:. Expression 2 uses a circle character to satisfy our needs.

Word boundary
We can already match the characters at the beginning and end of a row. What if we want to locate more than the beginning or end of a row? We need to introduce the word boundary symbol. The word boundary symbol is: "\ B". The slash cannot be omitted. Otherwise, it becomes a matching letter B. With the word boundary symbol, we can locate that the matching position must appear at the beginning or end of a word, rather than in the middle of the word. For example, the "\ bis \ B" expression matches the word "is" in the string "This is a cat." instead of "is" in the word "This ".

String boundary symbol
In addition to the above positional symbols, if we want to match the entire string (including multiple words), we can use the following two symbols:
\ A: the start of the string;
\ Z: end of the string.
Expression: "\ AThis is a cat \ z" will match This string "This is a cat ".
The use of boundary positioning symbols, an important concept to be mentioned here, is the word character, which represents the character that can constitute the word, which is any character in [a-zA-Z0-9. Therefore, the above expression will be matched in the sentence "This is a cat. The matched results do not contain periods.


Number of duplicates
Let's look at the expression: "Colou? R ", this expression contains a question mark that we have not met (this question mark matches the question mark of the file name ), it indicates the number of times a character before the symbol can be repeated ,"? "Indicates 0 times or 1 time, and the question mark in the previous expression indicates that u can appear 0 or 1 time, so it will match" Color "or" Colour ".
The following are other repeated quantity symbols:
+: Indicates one or more times.
*: Indicates zero or multiple times.
For example, if we want to represent one or more spaces, we can write the expression: "+ ";

What is the specific number of times? We will introduce the symbol {}.
{N}: n is a specific number, indicating repeated n times.
{N, m}: indicates the minimum time, up to m.

These symbols are limited to the number of matches of the first character of the symbol. But what if you want to repeat multiple characters, such as a word? We use parentheses again. We use parentheses as the range symbol of the option. Here is another usage method of parentheses, which is represented as a group, for example, an expression: "(this)" here this is a group, so the problem is easy to solve. The repeated quantity symbol can be used to indicate the number of repetitions in the previous group.

Now let's go back to the problem of searching for duplicate words. If we want to find the "the", we can write the expressions based on the knowledge we have learned so far:
"\ Bthe + the \ B"
The expression indicates that two of the two are separated by one or more spaces.
Similarly, we can also write:
"\ B (the +) {2 }"

But what if I want to find all the possible duplicate words? Our current knowledge is not enough to solve this problem. Next we introduce the concept of reverse reference. We have seen that parentheses can serve as the boundary of a group, an expression can contain multiple groups limited by parentheses. according to the order in which they appear, these groups are assigned a group number by default, and the first group number that appears is 1, and so on. In this case, you can use "\ n" to reference this group at the position of the subsequent expression. Here n is the referenced group number. Reverse reference is like a variable in a program. Let's look at the specific example below:
The previous word repetition expression can be written as follows:
"\ B (the) + \ 1 \ B"
Now, if we want to match all repeated words, we can rewrite the expression:
"\ B ([a-zA-Z] +) + \ 1 \ B"

The last question is, what if we want to match the characters in a regular expression? Yes, use the Escape Character "\". For example, if you want to match a decimal point, you can :"\. ", it should be noted that if the expression is used in the program," \ "should also be changed to" \ "according to the string rules or added @ before the expression @.

This chapter only provides a basic knowledge about regular expressions for cainiao. It is only a part of it. We still have many things to learn, which will be described in the following chapters. In fact, it is not difficult to learn regular expressions. You need patience and practice if you want to master it. Someone may say, "I don't want to know the details of a car. I just want to learn how to drive ." If you think the same way, you never know how to use regular expressions to solve your problem. Furthermore, you will never understand the real power of regular expressions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.