Using regular expressions to crawl the list data _ Regular Expressions of blog parks

Source: Internet
Author: User

In view of the ASP.net MVC 3 I was trying to complete, I used the test data in the blog Garden Enterprise system, I entered too tired, so I grabbed the part of the blog Park list data, please Dudu no offense.

In the crawl blog data when the regular expression, so there is not familiar with the regular expression of friends can refer to the relevant information, in fact, it is easy to grasp, that is, in specific instances will take some time.

Now I'm going to tell you about the process of capturing my blog data, and if you have friends who have better opinions, you are welcome to come forward.

To use regular expressions to crawl data, we first need to create a regular expression to match, I recommend using regulator, the regular expression tool, we can use this tool to stitch out the regular expression we want to use, and then use it in the program.

I found that the homepage of the blog Park can be http://www.cnblogs.com/p1,p2 ... This way to direct access, so that we can get the data directly through the URL, rather than analog data click events to Virtual click on the next page of the button to get the data, more convenient. Because my goal is to crawl some data, so it's simple.

1. The first is to write the corresponding SQL helper class, I believe this is a lot of programmers will master, nothing more than the operation to check and delete. Once we have created the SqlHelper class, we can begin to do the logical processing of the fetching data.

2. Create Blogregexcontroller

public class Blogregexcontroller:controller
   {public
     void Executeregex ()
     {
       string strbaseurl = ' http ://www.cnblogs.com/p ";  Define the base address for the list data that the blog park can access for
       (int i =; I <= i++)//Because the homepage list of the blog is the largest only page, so our loop executes
       {
         string strurl = Strbaseurl + i.ToString ();
         Blogrege Blogregex = new Blogrege (); Define the specific Regex class to crawl the blog Park address
         string result = Blogregex.sendurl (strURL);
         blogregex.analysishtml (result);
 
         Response.Write ("Get Success");
       }
     //Get:/blogregex/public
 
     actionresult Index ()
     {
       executeregex ();
       return View ();
     }
 
   

The Executeregex () method in controller is to execute the Capture blog list data.

3. First is the definition of the Blogrege class, he is responsible for crawling the blog Park list data and insert it into the database

The public class Blogrege {//is responsible for inserting data into the database using the SqlHelper class public void Insert (string title, String content,string L
       Inkurl, int categoryid =) {SqlHelper helper = new SqlHelper (); Helper.
     Insert (title, content, Categoryid,linkurl); ///<summary>///Get specific page content via URL address to initiate a request for HTML content///</summary>///<param name= "Stru 
       RL "></param>///<returns></returns> public string Sendurl (string strurl) {try
         {WebRequest WebRequest = WebRequest.Create (strURL);
         WebResponse WebResponse = Webrequest.getresponse ();
         StreamReader reader = new StreamReader (WebResponse.GetResponseStream ()); string result = Reader.
         ReadToEnd ();
       return result;
       catch (Exception ex) {throw ex; }///<summary>///parse HTML parse out inside specific data///</summary>///<param name= "Htmlconten T "&GT;&LT;/PARAM&GT public void analysishtml (string htmlcontent) {//This is the regular expression that I have captured in the regulator regular expression tool. Also, note that the problem with the escape character string str Pattern = "<div\\s*class=\" post_item\ ">\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*.*\\s*<div\\s*class=\" Post_item_body\ ">\\s* 

4. Through the above code we can easily from the blog to get the data we use for testing, convenient and fast, and real, than we manually entered a lot faster.

Regular expressions should not be regarded as a language, only a kind of grammar, because any language including C#,javascript and other languages have a good support for regular expressions, but their use of syntax slightly different, in fact, as long as we can correctly stitching out regular expressions, Then we can crawl any content of the website is very easy to do. I tried to crawl the first section of the data Taobao, a total of millions of crawled, I think there should be a lot of not crawl to, have to admire Taobao, the amount of data is too large.

Back to the C # language we use, there is also very good support for regular expressions, and regex is the class that is used to manipulate regular expressions, all of which operate on regular expressions in this class.

If you are not too familiar with regular expressions, there is a regular expression on the Internet 30-minute tutorial, you can refer to, write very well. Plus the use of a regular expression tool, I believe you can crawl to any content you want.

When stitching regular expressions, it can take a long time, after all, to parse the HTML structure and crawl content from it. I hope you can take it easy, because as long as the regular expression stitching is correct, it will be able to crawl the correct content.

In order to avoid the words and deeds, then I will crawl my blog home page content show, because the blog home page data will be updated, so you can see that these data are in the blog in order to exist in the park.

The blog Park each page list is 20, altogether 200 pages, therefore altogether is 4,000. Data capture is correct.

I've said before, just code programmers are not necessarily qualified programmers, programmers should reduce their workload as much as possible, because we are highly intelligent people. So we should actively learn a variety of frameworks that help our work or methods, such as the IOC, Entity Framework or nhibernate framework to ease the burden of developing and maintaining code, after all, we hear the need to change the response is generally angry, and then lambaste, The last is the change. There are some frameworks that can help us, give us a good mood to maintain code, why not.

I finally say, because I want to develop a simple imitation blog site (MVC3), so will use a variety of technical preparation, I write in advance to the use of the content to clean up, for future development acceleration.

Next time, I'm going to tidy up the way to use the text editor kindeditor in MVC, and I hope that if you have any good ideas or information to offer, let me also add some insights. Thank you, everybody.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.