Regular Expressions are that simple !, Regular Expression

Source: Internet
Author: User

Regular Expressions are that simple !, Regular Expression
Preface

I believe many people know the regular expression, but the first thing many people feel is hard to learn, because at first glance, they feel that there is no rule to find, and it's all a bunch of special symbols, completely cloudification.

In fact, you just don't know about regular expressions. Once you understand the regular expressions, you will find that there are actually not many relevant characters in the regular expressions, and it is not difficult to remember, the only difficulty is that, after the combination, the readability is poor and it is not easy to understand. This article aims to give you a basic understanding of regular expressions and understand simple regular expressions, write a simple regular expression to meet the needs of daily development.

0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} first. If you do not know about regular expressions, do you have no idea what this string of characters means? This does not matter. The article will explain the meaning of each character in detail.

 

1.1 What is a regular expression

A regular expression is a special string pattern used to match a set of strings. It is like using a mold as a product, while a regular expression is the mold, defining a rule to match characters that conform to the rule.

1.2 common Regular Expression matching tools

Online Matching tool:

1 http://www.regexpal.com/

2 http://rubular.com/

Regular Expression Matching Software

McTracer

After using a few times, I still think this is the best use. It also helps you escape the regular expression into a corresponding language, such as java C # js, and Copy can be used directly, in addition, it supports regular expression usage and interpretation, such as the capture group and greedy match, So Happy.

 

Introduction to binary regular character

2.1 characters

"^": ^ Matches the starting position of the line or string, and sometimes matches the starting position of the entire document.

"$": $ Matches the end of a line or string

In addition, the matched characters must start with This and contain spaces. They must end with Regex, and cannot contain spaces or other characters.

 

 "\ B": No characters are consumed to match only one position, it is often used to match word boundary. If I want to "This is Regex" in a string to match a separate word "is" regular, it must be written as "\ bis \ B"

\ B does not match the characters on both sides of is, but it recognizes whether is a word boundary.

 "\ D": Match numbers,

For example, to match a phone number in a fixed format with the first four digits and the last seven digits starting with 0, such as 0737-5686123 regular expression: ^ 0 \ d-\ d $ here is just to introduce the character "\ d, there are actually better writing methods that will be introduced below.

 "\ W": Match letters, numbers, and underscores.

For example, I want to match "a2345BCD _ TTz" regular: "\ w +" here, the "+" character refers to the number of times that a quantifier repeats. I will introduce it in detail later.

 "\ S": Matching space

For example, the character "a B c" regular: "\ w \ s \ w" is followed by a space, if there are multiple spaces between characters, write "\ s" as "\ s +" to repeat the spaces.

 ".": Match any character except line breaks

This is an enhanced version of "\ w". "\ w" cannot match spaces. If "\ w" is used to add spaces to the string, it will be limited. Please refer ". "how to match the character" a23 4 5 B C D _ TTz "regular :". +"

"[Abc]": Character groups match characters that contain elements in parentheses

This is relatively simple. It only matches the characters in the brackets. It can also be written as [a-z] That matches a to z. Therefore, the letter can be used to control that only English characters can be entered,

 

2.2 negative sense

It is easy to write the statement in uppercase, which means the opposite of the original one. Here we will not give an example.

"\ W"Match any character that is not a letter, number, or underline

"\ S"Match any character that is not a blank character

"\ D"Match any non-numeric characters

"\ B"Match is not the start or end of a word

"[^ Abc]"Match any character except abc

 

 2.3 quantifiers

First, we will explain three important concepts involved in quantifiers.

Greedy (Greedy) such as "*" character greedy quantifiers will first match the entire string. When trying to match, it will select as much content as possible. If it fails, it will return one character, the process of trying to roll back again is called backtracking. It will roll back a character each time until Matching content is found or no character can be rolled back. Compared with the following two greedy quantifiers, resource consumption is the largest,

Laziness (barely) is like "? "The Lazy quantizer uses another method for matching. It tries to match from the starting position of the target, checks a character each time, and searches for the content it wants to match. This loops until the end of the character.

If "+" is occupied, it will overwrite the target string and then try to find the matching content. However, it only tries once and does not trace back. It is like grabbing a stone first, then pick out gold from the stone

"*" (Greedy)Duplicate zero or more times

For example, "aaaaaaaa" matches all the regular expressions in the string: "a *" will output all the characters ""

"+" (Lazy)Repeat once or more times

For example, "aaaaaaaa" matches all a regular expressions in a string: "a +" Retrieves all a characters in the string, the difference between "a +" and "a *" is that "+" is at least one time and "*" can be zero,

Will be later "? "Character combination to reflect this difference

"? "(Possession)Zero or one repetition

For example, "aaaaaaaa" matches a regular expression in a string: "? "Match only once, that is, the result is only a single character.

"{N }"Repeated n times

For example, if "aaaaaaaa" matches string a and repeats the regular expression three times: "a {3}", the result is 3 a characters "aaa ";

"{N, m }"Repeat n to m times

For example, the regular expression "a {3, 4}" matches a repeatedly for three or four times. Therefore, the matching character can be three "aaa" or four "aaaa" regular expressions. to

"{N ,}"Repeat n or more times

The difference from {n, m} is that there is no upper limit on the number of matches, but at least n times must be repeated, such as regular "a {3,}" a must be repeated three times at least.

After learning the quantifiers, the regular expression matching the phone number can now be simplified ^ 0 \ d-\ d \ d $ can be changed to "^ 0 \ d +-\ d {7} $ ".

This writing is not perfect. If there is no limit on the area code above, you can enter many of them, but usually only three or four digits,

Now, change "^ 0 \ d {2, 3}-\ d {7}" so that the area code can match three or four digits.

2.4 lazy qualifier

  "*? "Repeat any time, but as few as possible

For example, "acbacb" regular ".*? B "will only get the first" acb ", which can be obtained in full but with a qualifier, will only match as few characters as possible, and the result of" acbacb "with the least character is" acb"

  "+? "Repeat once or more times, but as few as possible

Same as above, it must be repeated at least once.

  "?? "Repeated 0 or 1 times, but as few as possible

For example, "aaacb" regular ".?? B "only gets the last three characters" acb"

  "{N, m }? "Repeat n to m times, but as few as possible

For example, the "aaaaaaaa" regular "a {0, m}" result is null because it is at least 0 times.

  "{N ,}? "Repeated more than n times, but as few as possible

For example, "aaaaaaa" regular "a {1,}" is at least one time, so the result is ""

 

Advanced tri-Regular Expressions

3.1 capture group

First, understand the concept of capturing groups in regular expressions. In fact, the content in a bracket is like "(\ d) \ d" and "(\ d)". This is a capture group, you can reference a capture group backward. (If the same content exists, you can directly reference the previously defined capture group to simplify the expression.) For example, (\ d) \ d \ 1 here "\ 1" is the back reference to "(\ d )"

So what is the usage of the capture group? Let's look at the example.

For example, "zery" regular \ B (\ w +) \ B \ s \ 1 \ B. Therefore, the characters captured by "\ 1" here are also (\ w +) the same "zery" can be customized to make group names more meaningful.

"\ B (? <Name> \ w +) \ B \ s \ k <name> \ B ""? <Name> "you can customize the group name, but remember to write" \ k <name> "to reference the Group later. After the custom group name, the value matched in the capture group is saved in the defined group name.

Common usage of capture groups is listed below

 

"(Exp )"Match exp and capture text to automatically named group

"(? <Name> exp )"Match exp and capture the text to the group named name

"(? : Exp )"Matches exp, does not capture matched text, and does not assign group numbers to this group

The following is a zero-width assertion.

"(? = Exp )"Match the position before exp

For example, "How are you doing" regular "(? <Txt>. + (? = Ing) "Here we take the characters before ing and define a capture group named" txt "and" txt "with the value" How are you do ";

"(? <= Exp )"Match position after exp

For example, "How are you doing" regular "(? <Txt> (? <= How ). +) "Here we take all the characters after" How "and define a capture group named" txt "and" txt "with the value" are you doing ";

"(?! Exp )"The position behind matching is not exp

For example, "123abc" regular "\ d {3 }(?! \ D) "matches the result of a third digit that is not a number.

"(? <! Exp )"Match the position that is not exp

For example, "abc123" regular "(? <! [0-9]) 123 "matching" 123 "can contain non-numeric results or "(?! <\ D) 123"

 

Four Regular Expressions

Regular Expressions are being verified, and the power of data filtering is enormous. All I want to know is that, next we will combine all we have just learned to do a practical job of data collection and use regular expressions to filter Html tags and obtain the corresponding data.

We chose the blog garden as our battlefield. Suppose we want to collect all the article information on the homepage of the blog garden, including the article title, link to the author's blog address, Article introduction, and Article release time, read data, comments, and recommendations.

 

First look at the Html format of the blog garden article

<Div class = "post_item"> <div class = "digg"> <div class = "diggit" onclick = "DiggIt (3439076,120879, 1) "> <span class =" diggnum "id =" digg_count_3439076 "> 4 </span> </div> <div class =" clear "> </div> <div id = "digg_tip_3439076" class = "digg_tip"> </div> <div class = "post_item_body"> 

 

 

By constructing an Http request to obtain the data and process the data accordingly to obtain key information, the powerful power of regular expressions is reflected when Html tags are filtered to get articles,

The knowledge points of regular expressions are also basically used, for example, "\ s \ w + .*? "Capture groups, zero-width assertions, and so on. If you like it, you can give it a try and see how to obtain the corresponding data through regular expressions. The regular expressions in the Code are very simple, and their meanings and usage are described in detail in this article.

 

Class Program {static void Main (string [] args) {string content = HttpUtility. httpGetHtml (); HttpUtility. getArticles (content) ;}} internal class HttpUtility {// obtain the first page by default. The first page is public static string HttpGetHtml () {HttpWebRequest request = (HttpWebRequest) WebRequest. create ("http://www.cnblogs.com/"); request. accept = "text/plain, */*; q = 0.01"; request. method = "GET"; request. headers. add ("Accept-Language "," Zh-cn, zh; q = 0.8, en-us; q = 0.5, en; q = 0.3 "); request. contentLength = 0; request. host = "www.cnblogs.com"; request. userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.1.3.5000 Chrome/26.0.1410.43 Safari/537.1"; HttpWebResponse response = (HttpWebResponse) request. getResponse (); Stream responStream = response. getResponseStream (); StreamReader reader = new StreamRea Der (responStream, Encoding. UTF8); string content = reader. readToEnd (); return content;} public static List <Article> GetArticles (string htmlString) {List <Article> articleList = new List <Article> (); Regex regex = null; article article = null; regex = new Regex ("<div class = \" post_item \ "> (? <Content> .*?) (? = <Div class = \ "clear \"> "+ @" </div> \ s * </div>) ", RegexOptions. singleline); if (regex. isMatch (htmlString) {MatchCollection aritcles = regex. matches (htmlString); foreach (Match item in aritcles) {article = new Article (); // obtain regex = new Regex ("<div class = \" digg \ ">. * <span. *> (? <DigNum>. *) "+ @" </span> "+ ". * <div class = \ "post_item_body \"> ", RegexOptions. singleline); article. diggNum = regex. match (item. value ). groups ["digNum"]. value; // Escape Character regex = new Regex ("

The regular expression may not be perfect, but at least it matches the regular expression. In addition, because you are new to the regular expression, you can only write this simple regular expression. Hope you can see haihan ~~

 

 

Summary

Regular Expressions are actually not difficult. Once you understand the meaning of each symbol, you can try it and write it several times more. Regular expressions have many pitfalls, if I did not write a few points, I could not match the data. I also stepped on a lot of pitfalls and stepped on to experience.

This article only gives a basic introduction to regular expressions. There are still many characters that are not described, but some of them are commonly used. In case of any errors, I hope to point out in the comments that I will modify them immediately.

 

 

If you think this article has some benefits for youRecommendationFor my support, thank you ~

If you want to have more friends on the technical pathFollow me, Let's run on the technology road together

 

C # basic knowledge label: C # basic https://www.cnblogs.com/zery/p/3438845.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.