It's so simple to read regular expressions--regular expressions

Source: Internet
Author: User
Tags html tags numeric regular expression

A preface

For regular expressions, I believe a lot of people know, but a lot of people's first feeling is difficult to learn, because look at the first time, feel completely no law to find, and all is a bunch of various special symbols, completely unintelligible.

It's just that you don't know what you're doing, and you find out, that's the way it is. Regular use of the relevant characters are not much, it is not difficult to remember, more difficult to understand, the only difficulty is combined, the readability is poor, and not easy to understand, this article is intended to let you have a basic understanding, Can read simple regular expressions, write a simple regular expression, to meet the needs of the day-to-day development.

0\D{2}-\D{8}|0\D{3}-\D{7} first to a regular, if you do not understand, is not completely unaware of this string of characters what meaning? This doesn't matter. The article will explain the meaning of each character in detail.

1.1 What is a regular expression

A regular expression is a special string pattern used to match a set of strings, like a product with a stencil, and a regular is the stencil that defines a rule to match the character that matches the rule.

1.2 Common regular matching tools

Online Matching tool:

1 http://www.regexpal.com/

2 http://rubular.com/

Regular matching software

Mctracer

After a few, I still think this is the best use, support will be a regular guide to the corresponding language such as Java C # JS also help you escape, copy directly with the line is very convenient, and also support the use of regular expressions to explain, such as which paragraph is capturing grouping, which is greedy match, in short, so Happy.

Simple introduction of two regular characters

2.1-Yuan Character introduction

"^": ^ matches the starting position of the line or string, and sometimes the beginning of the entire document.

' $ ': $ will match the end of the line or string

As shown in figure

and the matching character must be preceded by a space or no, must end with a regex, and cannot have spaces and other characters

"\b": does not consume any characters to match only one position and is often used to match word boundaries as I want to match a separate word "is" from a string "is" just write "\bis\b"

\b does not match characters on both sides of IS, but it recognizes whether both sides of the is a word boundary

"\d": matching numbers,

For example, to match a fixed-format phone number with 0 first 4 digits after 7 digits, such as 0737-5686123 Regular: ^0\d\d\d-\d\d\d\d\d\d\d$ here just to introduce the "\d" character, in fact there is a better way of writing will be described below.

"\w": matches letters, numbers, underscores.

For example, I want to match the "A2345bcd__ttz" Regular: "\w+" here the "+" character is a quantifier refers to the number of repetitions, will be described in detail later.

"\s": Matching spaces

For example, the character "a B C" is: "\w\s\w\s\w" a character followed by a space, such as a number of spaces between the characters directly to "\s" written "\s+" to allow space to repeat

"." : matches any character except line breaks

This is the "\w" of the enhanced version of "\w" can not match the space if the string plus space with "\w" is limited, look at the use of "." is how to match the character "A23 4 5 B C D__ttz" Regular: ". +"

[ABC]: Character groups match characters that contain elements in parentheses

This is simpler to match only the characters in parentheses, and can be written as [A-z] to match A to Z so that the letters can be used to control only the input in English,

2.2 Kinds of anti-righteousness

The writing is very simple to capitalize on the line, meaning contrary to the original, here is not to cite examples

"\w" matches any character that is not a letter, number, or underscore

"\s" matches any character that is not whitespace

"\d" matches any number of non-numeric characters

"\b" match is not the beginning or end of a word

' [^ABC] ' matches any character other than ABC

2.3 Classifier

First, explain the important three concepts involved in quantifiers

Greedy (greedy), such as "*" character greedy quantifiers will first match the entire string, when you try to match, it selects as much content as possible, and if it fails, it is called backtracking, and then the process of returning again, it returns one character at a time, until a match is found or no character can be rolled back. The consumption of resources is greatest compared to the following two greedy quantifiers,

Lazy (reluctantly) as "?" Lazy quantifiers use another way to match, starting at the start of a target to try to match, checking one character at a time, looking for what it wants to match, and looping until the end of the character.

Possession such as "+" will cover the object string, and then try to find the matching content, but it will only try once, not backtracking, it is like grabbing a stone, and then pick out the stones from the gold

"*" (greed) repeat 0 times or more

For example, "aaaaaaaa" matches all a regular in the string: "A *" will go out to all the characters "a"

"+" (lazy) repeat one or more times

For example, "aaaaaaaa" matches all a positive in a string: "A +" takes all a characters in a character, "A +" differs from a "a *" in that "+" at least once and "*" can be 0 times,

Will be with "?" later. Character combination to reflect this difference

"?" (possession) Repeat 0 times or once

For example, a regular in the "aaaaaaaa" matching string: "A?" will only match once, i.e. the result is just a single character a

' {n} ' repeats n times

For example, from "Aaaaaaaa" match a string of a and repeat 3 times regular: "a{3}" result is to take 3 a character "AAA";

' {n,m} ' repeats N to M times

For example, the regular "a{3,4}" matches a 3 or 4 times so that the matching character can be three "AAA" or four "AAAA" can be matched to

"{N,}" repeats n times or more times

The difference with {n,m} is that there will be no upper bound for the number of matches, but at least repeat n times like regular "a{3,}" A at least 3 times

To understand the quantifier before the matching phone number is now can be changed to simple ^0\d\d\d-\d\d\d\d\d\d\d$ can be changed to "^0\d+-\d{7}$."

It's not perfect. If the previous area code is not qualified so that you can enter a lot of them, usually only 3 or 4 digits,

Now change the "^0\d{2,3}-\d{7}" so that the area code can match 3 or 4 digits.

2.4 Lazy Qualifier

"*?" Repeat any time, but try to repeat as little as possible

such as "ACBACB" regular "a.*?b" will only be taken to the first "ACB" can be all but add qualifiers, will only match as few characters, and "ACBACB" The result of the least characters is "ACB"

"+?" Repeat 1 or more times, but repeat as little as possible

Same as above, just repeat at least 1 times

"??" Repeat 0 or 1 times, but repeat as little as possible

such as "AAACB" a.?? "B" will only fetch the last three characters "ACB"

"{n,m}?" Repeat N to M times, but repeat as little as possible

such as "aaaaaaaa" regular "a{0,m}" because at least 0 times so the result is null

"{N,}?" Repeat more than n times, but repeat as little as possible

such as "AAAAAAA" regular "A{1,}" is at least 1 times so the result is "a"

Three regular advanced

3.1 capture groupings

To understand the concept of capturing groupings in the regular, it is actually a bracketed content such as "(\d) \d" and "(\d)" which is a capture grouping, a back reference can be made to a captured grouping (if the same content is followed by a reference to the previously defined capturing group to simplify the expression) such as (\d) \d\1 The "\1" here is the back reference to "(\d)"

What's the use of capturing groups?

such as "Zery zery" regular \b (\w+) \b\s\1\b so here's "\1" captured the same character as (\w+) "Zery", in order to make the group name more meaningful, group names can be customized names

"\b (? <name>\w+) \b\s\k<name>\b" uses "?<name>" to customize the group name, and to refer back to the group as "\k<name>"; after customizing the group name, The values that are matched in the capturing group are saved in the defined group name

The following are common uses for capturing groupings

"(exp)" matches exp, and captures text into an automatically named group

"(? <name>exp)" Matches exp and captures the text in a group named name

"(?: EXP)" matches exp, does not capture matching text, and does not assign group numbers to this group

The following is a zero-width assertion

"(? =exp)" matches the position of the exp front

such as "How to Are you doing" regular "(? <txt>.+ (? =ing))" Here take all the characters before ING, and define a capture group name for "TXT" and "TXT" in this group of the value of "How are to do";

"(? <=exp)" matches the position of the exp rear

such as "How are you doing" regular "(?<txt> (? <=how). +)" Here take "how" all the characters, and define a capture group name "TXT" and "TXT" the value of this group is "Are you doing";

"(?! EXP) "Match the position that follows not exp

such as "123ABC" regular "\d{3} (?!) \d) "The result of a non-digit match after 3 digits

"(? <!exp)" matches a position not previously exp

such as "abc123" ("Regular") (? <![ 0-9]) 123 "Match 123" preceded by a non-numeric result can also be written as "(?!) <\D) 123 "

Four is the actual combat

is in doing validation, and data filtering reflects the power is huge, I want to use the friends know, the following we have just understood all combine to do a real combat do data collection with regular filter HTML tags and take the corresponding data

Our battlefield is chosen in the blog park, assuming now to collect all the article information of the blog home page includes the title of the article, link to the Author blog address, article introduction, article release time, reading data, comments, recommended number.

Read the HTML format of the blog post first

<div class= "Post_item" > <div class= "Digg" > <div class= "diggit" onclick= "Diggit (3439076,120879,1)" > <span class= "Diggnum" id= "digg_count_3439076" >4</span> </div> <div class= "Clear" ></div > <div id= "digg_tip_3439076" class= "Digg_tip" ></div> </div> <div class= "Post_item_body" > & Lt;h3><a class= "Titlelnk" href= "http://www.cnblogs.com/swq6413/p/3439076.html" target= "_blank" > Share the complete project Engineering directory structure </a> 

By constructing an HTTP request to fetch the data and process the data accordingly, the powerful power of filtering the HTML tags to get the article is reflected.

Regular knowledge points are also basically used, such as "\s \w+." * ? "There are also capture groupings, 0-wide assertions, and so on. Like friends can try, and then see how to get the corresponding data through the regular, the code is very basic simple, the meaning and usage are detailed in the above.

Class Program {static void Main (string[] args) {String content = httputility.httpgethtml ();
  Httputility.getarticles (content); } internal class HttpUtility {//Default get first page data public static string httpgethtml () {HttpWebRequest request = (Ht
   Tpwebrequest) webrequest.create ("http://www.cnblogs.com/"); Request. Accept = "Text/plain, */*;
   q=0.01 "; Request.
   method = ' Get '; Request.
   Headers.add ("Accept-language", "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3"); Request.
   contentlength = 0; Request.
   Host = "www.cnblogs.com"; Request. useragent = "mozilla/5.0 (Windows NT 6.1) applewebkit/537.1 (khtml, like Gecko) maxthon/4.1.3.5000 chrome/26.0.1410.43 SAF
   ari/537.1 "; HttpWebResponse response = (HttpWebResponse) request.
   GetResponse (); Stream Responstream = Response.
   GetResponseStream ();
   StreamReader reader = new StreamReader (Responstream, Encoding.UTF8); String content = Reader.
   ReadToEnd ();
  return content; } public static list<article> Getarticles (sTring htmlstring) {list<article> articlelist = new list<article> ();
   Regex regex = null;
   Article Article = null; regex = new Regex ("<div class=\" post_item\) > (? <content>.*?)
   (? =<div class=\ "Clear\" > "+ @" </div>\s*</div>) ", regexoptions.singleline); if (regex). IsMatch (htmlstring)) {MatchCollection aritcles = regex.
    Matches (htmlstring);
     foreach (Match item in aritcles) {article = new article (); Take recommended regex = new Regex ("<div class=\" digg\ ">.*<span.*> (? <dignum>.*)" + @ "&LT;/SPAN&G
     t; "+". *<div class=\ "post_item_body\" > ", Regexoptions.singleline); Article. Diggnum = Regex. Match (item. Value). groups["Dignum"].
     Value;
     Take the article title to remove the escape character regex = new Regex (" 

The regular part may not be very perfect, but at least it matches up, and because you are just a regular contact, you can only write this relatively simple regular. Also hope everybody Haihan ~ ~

Five summary

It is not difficult to understand the meaning of each symbol, I immediately try to write a few more times naturally understand, is the name of the pit more, casually less write a point on the matching data, I also stepped on a lot of pits, step on the step on the experience.

This article is just to do a very basic introduction, there are many regular characters did not introduce, but wrote a more commonly used some. If there is a mistake, I would like to point out in the comments that I will revise it immediately.

Well, on the regular expression of knowledge to introduce so many people, I hope to help you, if you have any questions welcome to my message, small series will promptly reply to everyone, here also thank you for your support cloud Habitat community site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.