About regular Expressions C #

Source: Internet
Author: User

A preface

For regular expressions, I believe a lot of people know, but a lot of people's first feeling is difficult to learn, because at the first glance, feel that there is no regular to find, and all a bunch of various special symbols, completely unintelligible.

In fact, just do not understand, and understand you will find that the original is so ah, the relevant characters used in fact not much, it is not difficult to remember, more difficult to understand, the only difficult is to combine, readability is poor, and not easy to understand, this article is intended to let everyone have a basic understanding of the regular, Can read simple regular expressions, write a simple regular expression, to meet the needs of daily development.

0\D{2}-\D{8}|0\D{3}-\D{7} First a regular, if you do not understand the regular, is not completely know what this string of characters mean? This doesn't matter. The article explains the meaning of each character in detail.

1.1 What is a regular expression

A regular expression is a special string pattern that matches a set of strings, like a product with a mold, and a regular is the mold that defines a rule to match the character that matches the rule.

1.2 Common regular matching tools

Online Matching tool:

1 http://www.regexpal.com/

2 http://rubular.com/

Regular matching software

Mctracer

After using a few or think this is the best use, support will be translated into the corresponding language such as Java C # JS and so also help you escape, copy directly with the line is convenient, in addition to support the use of regular expression interpretation, such as which is capturing the group, which is greedy match and so on, in short, so Happy.

A brief introduction to the two regular characters

2.1 Meta-character introduction

"^" : ^ matches the starting position of the line or string, and sometimes matches the starting position of the entire document.

"$" : $ matches the end of a line or string

And the matched character must have a space at the beginning of this, and must end with a regex, and cannot have spaces and other characters

"\b" : does not consume any characters that match only one position, often used to match a word boundary as I want to match a separate word "is" from the string "This is Regex" to be written "\bis\b"

\b does not match the character on both sides of IS, but it recognizes whether the word is on either side of the boundary

"\d": Match numbers,

For example, to match a fixed-format phone number with 0 first 4 digits after 7 bits, such as 0737-5686123 Regular: ^0\d\d\d-\d\d\d\d\d\d\d$ here just to introduce the "\d" character, there is actually a better notation described below.

"\w": matches letters, numbers, underscores.

For example I want to match "A2345bcd__ttz" Regular: "\w+" here the "+" character is a quantifier refers to the number of repetitions, will be described in detail later.

"\s": Match spaces

For example, the character "a B C" regular: "\w\s\w\s\w" a character followed by a space, if there are multiple spaces between characters directly to the "\s" written "\s+" let the space repeat

 "." : matches any character other than line break

This is a "\w" of the enhanced version of "\w" can not match the space if the string with a space with "\w" is limited, see the use of "." How to match the character "A23 4 5 B C D__ttz" Regular: ". +"

"[ABC]": Character groups match characters that contain elements within parentheses

This is easier to match only the characters that exist in parentheses, and can be written as [A-z] matches A to Z, so the letter is equal to the control can only be entered in English,

2.2 Kinds of anti-righteousness

The wording is very simple to capitalize on the line, the meaning of the contrary, here does not give an example

" \w " matches any character that is not a letter, number, underscore

"\s" matches any character that is not a white letter

"\d" matches any non-numeric character

"\b" matches a position that is not the beginning or end of a word

"[^abc]" matches any character except ABC

2.3 quantifier

First, explain the three important concepts involved in quantifiers

Greed (greed) such as "*" character greedy quantifiers will first match the entire string, when trying to match, it will select as much as possible, if it fails to fall back one character, and then try to fallback the process is called backtracking, it will fall back one character at a time until a match is found or no characters can be rolled back. Compared to the following two kinds of greedy quantifiers on the consumption of resources is the largest,

Lazy (reluctantly) like "?" The lazy quantifier is matched in another way, and it attempts to match from the beginning of the target, checking one character at a time and looking for what it wants to match, so looping until the end of the character.

Possession such as "+" the Word will overwrite the object string, and then try to find a match, but it only try once, do not backtrack, it is like catching a stone, and then from the stone to pick out the gold

"*" (greedy) repeat 0 or more times

For example, "AAAAAAAA" matches all the A in a string: "A *" will go out to all characters "a"

"+" (lazy) repeats one or more times

For example, "AAAAAAAA" matches all the A in the string: "A +" takes all the A characters, "A +" differs from "a *" in that "+" is at least once and "*" can be 0 times,

Will be with "?" later Character combination to reflect this distinction

"?"   (possession) Repeat 0 or one time

For example "AAAAAAAA" matches a regular in a string: "A?" matches only once, that is, the result is only a single character a

"{n}" repeats n times

For example, the "Aaaaaaaa" matches the string A and repeats 3 times: "A{3}" results in 3 A-character "AAA";

"{n,m}" repeats N to M times

For example, the regular "a{3,4}" matches a repeat 3 or 4 times so that the matching characters can be three "AAA" or four "AAAA" can match to

"{N,}" repeats n or more times

The difference from {n,m} is that there will be no upper limit on the number of matches, but at least repeat n times like regular "a{3," A at least 3 times

After the quantifier is known, the regular number of matching phone numbers can now be changed to a simpler point. ^0\d\d\d-\d\d\d\d\d\d\d$ can be changed to "^0\d+-\d{7}$".

It's not perfect. If the previous area code is not qualified so that you can enter a lot of people, and usually only 3 or 4 bits,

Now change the "^0\d{2,3}-\d{7}" so that the area code section can match 3-bit or 4-bit.

2.4 Lazy Qualifier

  "*?" Repeat any number of times, but repeat as little as possible

such as "ACBACB" regular "a.*?b" will only take to the first "ACB" can be taken all but the qualifier, only match as few characters as possible, and "ACBACB" the result of the minimum character is "ACB"

  "+?" Repeat 1 or more times, but repeat as little as possible

As above, just repeat at least 1 times

  "??" Repeat 0 or 1 times, but repeat as little as possible

such as "AAACB" a.?? B "will only take the last three characters" ACB "

  "{n,m}?" Repeat N to M times, but repeat as little as possible

such as "aaaaaaaa" regular "a{0,m}" because the minimum is 0 times, so take the result is empty

  "{N,}?" repeat more than n times, but repeat as little as possible

such as "AAAAAAA" regular "A{1,}" is at least 1 times, so take the result to "a"

The third regular step

3.1 Capturing groupings

First understand the concept of capturing groupings in the regular, in fact, is a bracketed content such as "(\d) \d" and "(\d)" This is a capture grouping, you can make a back reference to the capture group (if the same content then can directly refer to the previously defined capturing group to simplify the expression) such as (\d) \d\1 The "\1" here is the back reference to "(\d)"

What's the use of the capturing group? Look at an example.

such as "Zery zery" regular \b (\w+) \b\s\1\b so the "\1" here is captured by the same character as (\w+) "Zery", in order to make the group name more meaningful, the group name is a customizable name

"\b (? <name>\w+) \b\s\k<name>\b" with "?<name>" You can customize the group name and to the reference group to remember to write "\k<name>"; after customizing the group name, The matching values in the capturing group are saved in the defined group name

The following is a list of common uses of capturing groupings

"(exp)" matches exp, and captures text into an automatically named group

"(? <name>exp)" matches exp, and captures the text into a group named name

"(?: exp)" matches exp, does not capture matching text, and does not assign group numbers to this group

The following is a zero-width assertion

"(? =exp)" matches the position of the exp front

such as "How is You doing" (? <txt>.+ (? =ing)) "Here takes ing all the characters, and defines a capturing group named" TXT "and the" TXT "value in this group is" how is Do ";

"(? <=exp)" matches the position after exp

such as "How is You Doing" (?<txt> (? <=how). +) "Here takes" how "after all the characters, and defines a capturing group named" TXT "and" TXT "in this group the value is" is you doing ";

"(?! EXP) " match is followed by the location of the EXP

such as "123ABC" regular "\d{3}" (?! \d) "matches 3 digits after the non-numeric result

"(? <!exp)" matches a location that is not previously exp

such as "abc123" (? <![ 0-9]) 123 "Match" 123 "preceded by a non-numeric result can also be written as" (?! <\D) 123 "

Four regular combat

Regular in doing validation, and data filtering embodied in the power is huge, I want to use the friends know, below we have just learned all together to do a real-time data collection with regular filter HTML tags and take the corresponding data

Our battlefield is chosen in the blog Garden Bar, assuming now to collect all the blog home page information including the article title, link to the author's blog address, article introduction, the article published time, reading data, comments, recommendations.

First look at the HTML format of Blog Park articles

<div class= "Post_item" ><div class= "Digg" > <div class= "diggit" onclick= "Diggit (3439076,120879,1)" > <span class= "Diggnum" id= "digg_count_3439076" >4</span> </div> <div class= "Clear" ></div&        Gt <div id= "digg_tip_3439076" class= "Digg_tip" ></div></div> <div class= "Post_item_body" > < H3><a class= "Titlelnk" href= "http://www.cnblogs.com/swq6413/p/3439076.html" target= "_blank" > Share the complete project Engineering directory structure </a>

By constructing an HTTP request to fetch the data and processing the data to get the key information, the powerful power of the regular is reflected when the HTML tag is filtered to take the article.

Regular knowledge points are also basically used, such as "\s \w+." * ? "There are capturing groupings, 0-wide assertions, and so on. Like friends can try, and then see how to get the corresponding data through the regular, the code is very basic simple, its meaning and usage are detailed in the above.

    Class Program {static void Main (string[] args) {String content = Httputility.ht            Tpgethtml ();        Httputility.getarticles (content);            }} internal class HttpUtility {//Default get first page data public static string httpgethtml () {            HttpWebRequest request = (HttpWebRequest) webrequest.create ("http://www.cnblogs.com/"); Request. Accept = "Text/plain, */*;            q=0.01 "; Request.            Method = "GET"; Request.            Headers.add ("Accept-language", "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3"); Request.                       contentlength = 0; Request.            Host = "www.cnblogs.com"; Request. useragent = "mozilla/5.0 (Windows NT 6.1) applewebkit/537.1 (khtml, like Gecko) maxthon/4.1.3.5000 chrome/26.0.1410.43 SAF            ari/537.1 "; HttpWebResponse response = (HttpWebResponse) request.            GetResponse (); Stream Responstream = Response.            GetResponseStream (); StreamReader reader = NEW StreamReader (Responstream, Encoding.UTF8); String content = Reader.            ReadToEnd ();        return content; } public static list<article> Getarticles (String htmlstring) {list<article> Articlel            ist = new list<article> ();            Regex regex = null;            Article article = null; regex = new Regex ("<div class=\" post_item\ "> (? <content>.*?)            (? =<div class=\ "Clear\" > "+ @" </div>\s*</div>) ", regexoptions.singleline); if (regex. IsMatch (htmlstring)) {MatchCollection aritcles = regex.                Matches (htmlstring);                    foreach (Match item in aritcles) {article = new article (); Take the recommended regex = new Regex ("<div class=\" digg\ ";.                  *<span.*> (? <dignum>.*) "+ @" </span> "+          ". *<div class=\" post_item_body\ ">", Regexoptions.singleline); Article. Diggnum = Regex. Match (item. Value). groups["Dignum"].                    Value;                    Take the article title need to remove the escape character regex = new Regex ("

The regular part may not be perfect, but at least it matches, and because it is just a regular, it can only write this relatively simple regular. Also hope everyone Haihan ~ ~

Five summary

In fact, it is not difficult to understand the meaning of each symbol, they immediately try to write a few more natural to understand, is the name of the pit more, casually less write a point on the matching data, I also stepped on a lot of pits, stepping on stepping on the experience.

This article is just to do a very basic introduction, there are a lot of regular characters are not introduced, just write some of the more commonly used. If there are errors, I would like to point out in the comments that I will revise them immediately.

If you think this article has brought you a little gain, may wish to point a recommendation , for my pay support, thank you ~

If you want to have more friends on the road of technology, then pay attention to me , let us run on the road of technology

About regular Expressions C #

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.