Regular expression Learning materials in C #-Regular expressions

Last Update:2017-01-18 Source: Internet

Author: User

Tags html tags

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expressions in C #

Jeffrey E.f Friedl wrote a book on regular expressions, "proficient in regular expression." The author fabricated a story in order to make the reader better understand and master the regular expression. The language of the book is mainly Perl. As far as I know, regular expressions in C # are also based on PERL5. So they should have a lot in common. Http://ike.126.com
In fact, I do not intend to translate the contents of the book intact, one of the contents of this book is too much, I am not competent to translate the work; two if I really translated the book, while the inside of the code into C #, in the absence of the original author of the case, there may be a suspicion of infringement. Therefore, the right as a reading note good.

After a lengthy preface, we can go directly to chapter One:

Introduction of regular expressions

The author says this chapter is prepared for the absolute rookie of regular expression, and aims to lay a solid foundation for future chapters. So if you're not a rookie, you can ignore this chapter.

Story scene:
The head of your file department wants a tool to check for duplicate words (such as: this), a problem that is usually encountered when editing a document in large numbers. Your job is to create a solution:
Accept any number of files that you want to check, report those lines with repeated words in each file, highlight these duplicate words, and ensure that the original file name and the rows appear in the report.
Cross-row checks to find the last word in a row and the first word at the beginning of the next line to repeat.
Find duplicate words, whether they are different in size or not (e.g. the The), and allow different numbers of white space characters between these repeated words (spaces, tabs, new lines, and so on)
Find repetitive words and even these words are separated by HTML tags. (for example:..) It is <B>very</B> very important.)

To solve these practical problems, the first thing we need to do is to write regular expressions, find the text we want, ignore the text we don't need, and then use our C # code to process the obtained text.

Before you can use regular expressions, you probably already know exactly what a regular expression is. Even if you don't know, you are almost certainly familiar with its basic concepts.
You know report.txt is a specific file name, but if you have any UNIX or dos/windows experience, you know that "*.txt" can be used to select multiple files. In this form of file names, there are some characters that have special meanings. An asterisk means matching anything, and a question mark means matching one character. For example, "*.txt" indicates any file with a. txt file name.
The file name has a pattern match, with a limited number of matches. There are also search engines on the current network that allow for content search using certain matching characters. Regular expressions use rich matching characters to handle a variety of complex problems.

First we introduce two position matching characters:
^: Represents the starting position of a line of text
$: Indicates the end position of a line of text

such as: expression: "^cat", the matching word Cat appears at the beginning of the line, note that ^ is a positional character, not to match the character itself.
Similarly, the expression: "cat$" matches the word Cat appears at the end of a line.

Next, we describe the square brackets "[]" in the expression, which represent one of the characters in the parentheses. Such as:
Expression: "[0123456789]" will match any of the digits 0 through 9.
For example: We're looking for text, all of which contain gray or grey, so the expression can be written like this: "Gr[ea]y"
[EA] represents one of the matching EA, not the entire EA.

If we want to match the <H1><H2><H3><H4><H5><H6> tags in html, we can write an expression:
"<H[123456]>", but what if we want to match one of all characters? Ah, here's the question, write all the characters in square brackets? Luckily, we don't have to do this, we introduce a range symbol "-";
Using the scope notation, we only need to give a range of boundary characters, above the HTML example, we can write: "<H[1-6]>"
and the expression: "[0-9a-za-z]" meaning now clear? It matches numeric characters, 26 letters in lowercase, and one in uppercase 26 letters.

The "^" symbol that appears in []
If you see an expression such as "[^0-9]", at this point, "^" is no longer the position symbol, where it is a negative symbol, which means the meaning of the exclusion, the expression above, which indicates that the character does not contain the number 0 to 9.

Think 1: The meaning of the expression "Q[^u". If there are the following words, those will be matched?
Iraqi
Iraqian
Miqra
Qasida
Qintar
Qoph
Zaqqum

In addition to the representation of the range character, there is a dot character ".", and the dot character appears in the expression to match any character.
As an expression: "07.04.76" will match:
Shaped like: 07/04/76, 07-04-76,07.04.76.

If we need to choose among certain characters, we can use the option character "|" ：
The option character has the meaning of "or", such as an expression: "[bob| Robert] "means Bob or Robert will be matched.
Now look at the expression we mentioned earlier: "Gr[ea]y", we can write "Grey|gray" with the option characters, they are the same.
Use of parentheses: parentheses are also used as metacharacters in expressions, as in the previous expression, we can write: "gr (e|a) Y", where the parentheses are necessary, if there is no parentheses, then the expression "Gre|ay" will match the GRE or AY, which is not the result we want. If you're not quite sure, let's take a look at the following example:
To find all the rows from: or Subject: or date: At the beginning of the e-mail message, we compare the following two expressions:
Expression 1: "^from| Subject| Data: "
Expression 2: "^" (from| Subject| Data): "
Which one is what we want?
Obviously, the result of expression 1 is not the result we want, and it matches the following: From or SUBJEC or data: The expression 2 uses a circle-like character to meet our needs.

Word boundaries
We can already match the characters that appear at the beginning and end of the line, so what if we want to locate more than just the beginning or end of the line? We need to introduce the word boundary symbol, the word boundary symbol is: "\b", the slash can not be omitted, otherwise it will become a match letter B. Using the word boundary notation, we can locate the matching position at the beginning or end of a word, rather than in the middle of the word. For example: "\bis\b" expression in the string "This is a cat." Will match the word "is" without matching the "is" in the word "this".

String boundary symbol
In addition to the above position symbol, if we want to match the entire string (including multiple words) then we can use the following two symbols:
\a: Represents the beginning of a string;
\z: Represents the end of a string.
Expression: "\athis is a cat\z" will match this string "This is a cat".
Using the boundary locator notation, here is an important concept, which is the word character, which represents the characters that can form a word, and they are any of the characters in [a-za-z0-9]. So the above expression will also be in the sentence "This is a cat." Get a match. The result of the match does not contain a period.

Repeat quantity symbol
Let's look at the expression: "Colou?r", which shows a question mark that we haven't seen before, (this question mark is different from the question mark that matches the file name), which represents the number of times a character before a symbol can be repeated, "?" 0 or 1 times, the question mark in the preceding expression indicates that u can appear 0 or 1 times, so it will match "Color" or "colour".
Here are the other repeat quantity symbols:
+: Indicates 1 or more times
*: Indicates 0 or more times
For example, to represent one or more spaces, we can write an expression: "+";

If you want to indicate the number of times? We introduce the flower parenthesis {}.
{n}: N is a specific number that represents the repeated n times.
{N,m}: represents at least that time, up to M times.

These symbols all qualify the number of occurrences of one character before the symbol. But what if you want to repeat multiple characters, such as a word? We use parentheses again, preceded by the parentheses as the range symbol for the option, here is another way to use the circle, which is represented as a group, such as the expression: "(this)" Here is a group, then the problem is good, A repeat quantity symbol can be used to indicate the number of repetitions of a previous group.

Now back to the question of finding repeated words, if we're going to find the "the", based on what we've learned so far, we can write an expression:
"\bthe +the\b"
An expression means to match two of the middle with one or more spaces separated.
Similarly, we can also write:
"\b (the +) {2}"

But what if you're looking for all the possible repeat words? Our current knowledge is not enough to solve this problem, here we introduce the concept of a reverse reference, we have seen that parentheses can be the boundary of a group, an expression can have multiple groups bounded by parentheses, according to the order in which they appear, these groups are assigned a group number by default, the first group number is 1th, by analogy. Then the reverse reference is where you can use "\ n" to refer to the group as the next expression, where n is the referenced group number. The reverse reference is like a variable in a program, and we look at the specific example below:
The previous word repeat expression, now we use the reverse reference can be written:
"\b (The) +\1\b"
Now, if we want to match all the repeated words, we can rewrite the expression as:
"\b ([a-za-z]+) +\1\b"

The last question is, what if the character we want to match is a symbol in a regular expression? Yes, use the escape symbol "\", for example if you want to match a decimal dot, then you can: "\." and also note that if you use an expression in your program, then "\" will also follow the rules of the string into "\" or precede the expression with @.

This chapter is just a beginner's basic knowledge of regular expressions, it's just part of it, and we have a lot to learn, which will be covered in later chapters. In fact, regular expression learning is not difficult, you need patience and practice, if you want to master it. Maybe someone said, "I don't want to know the details of the car, I just want to learn how to drive." "If you think so, you never know how to use regular expressions to solve your problems, and you will never understand the true power of regular expressions," he said.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More