C # web page content capture, simple crawler implementation. (Including dynamic and static)

Source: Internet
Author: User

Sort out several projects recently. Summarize the knowledge points and key parts usedCodeFor your learning and communication. 1. crawlers crawl webpage content information. You can use the system. net. webrequest, WebClient, and other classes for processing. 2. For some dynamic webpages, the confidence in generating pages is dynamically generated by JavaScript. You can also analyze the value transfer method and include the parameters in post (the parameters of most websites are rule-based ). You can also use the webbrowser control to simulate clicks. Or pass the value. Use http://www.aslan.com.cn/code.aspx. Some code is as follows:

// Use webbrowser to access a specified webpage. Address is the webpage address

Private void navigate (webbrowser web, string address) {If (string. isnullorempty (Address) return; If (address. Equals ("about: blank") return; If (! Address. startswith ("http: //") Address = "http: //" + address; try {web. navigate (New uri (Address);} catch (system. uriformatexception) {return ;}

}

Because you need to capture the webpage content and submit parameters after loading, You need to verify that the loading is complete, that is, documentcompleted. In actual use, it is found that there may be multiple documentcompleted in the loading process of a page, so here we use the + 1,-1 method to determine whether the loading is complete.

First, bind the webpage loading completion event in formload.

Private void getcode3webbrowser_load (Object sender, eventargs e) <br/>{</P> <p> string address = "http://www.aslan.com.cn/Code.aspx"; <br/> This. navigate (webbrowser1, address); <br/> webbrowser1.navigated + = new ated (webbrowser_navigated); <br/> webbrowser1.documentcompleted + = new webbrowserdocumentcompletedeventhandler (webbrowser_documentcompleted); <br/>}

 

And define count

 

Int COUNT = 0;Then, after each navigation, Mark + 1Private void webbrowser_navigated (Object sender, webbrowsernavigatedeventargs e) <br/>{< br/> count ++; <br/>}

Count-1 in each documentcompleted file. When Count = 0, the page is loaded. You can process the page information.

 

Private void webbrowser_documentcompleted (Object sender, webbrowserdocumentcompletedeventargs e) {COUNT = count-1; string eventtarget = "dg_code $ ctl24 $ CTL "; if (0 = count & iscomplete = false & J <= 10) {eventtarget = eventtarget + getpage (j); If (! Islastpage (webbrowser1) {invokescriptmethod (webbrowser1, eventtarget, "");} else {MessageBox. show ("captured") ;}postcomplete = true; j ++;} else if (postcomplete = true) {dealwithbydom (webbrowser1); postcomplete = false ;} else if (0 = count & iscomplete) {system. windows. forms. htmldocument htdoc = webbrowser1.document; For (INT I = 0; I

The rest is the analysis of HTML. How can we find the information we need in the vast amount of HTML code? Here, I use the htmlagilitypack class to process HTML content extraction.

Among them, the htimlagilitypack class is a class provided by codeplex. It is very error to use http://htmlagilitypack.codeplex.com/to process HTML files (I personally feel quite helpful)

Private void dealwithbydom (webbrowser webbro) <br/>{< br/> htmlagilitypack. htmldocument htmldoc = new htmlagilitypack. htmldocument (); </P> <p> htmldoc. loadhtml (webbro. documenttext); <br/> htmlnode node1 = htmldoc. getelementbyid ("dg_code_ctl03_label5"); <br/> htmlnode node2 = htmldoc. getelementbyid ("dg_code_ctl03_label6"); <br/> htmlnode node3 = htmldoc. getelementbyid ("dg_code_ctl03_label7"); <br/> htmlnode node4 = htmldoc. getelementbyid ("dg_code_ctl03_label8"); <br/> htmlnode node5 = htmldoc. getelementbyid ("dg_code_ctl03_label9"); </P> <p> datarow DR = dt_finallyresult.newrow (); <br/> Dr ["three character code"] = node1.innertext; <br/> Dr ["city code"] = node2.innertext; <br/> Dr ["City Chinese name"] = node3.innertext; <br/> Dr ["city English name"] = node4.innertext; <br/> Dr ["country"] = node5.innertext; <br/> dt_finallyresult.rows.add (DR ); <br/> maid = dt_finallyresult; <br/>}

The above isProgram. The last is as follows: (it is better not to use webbrowser for crawling. It is too slow. I only need to capture 286 pages, but it took me nearly 10 minutes)

The following is the program running: (for ease of display, the webbrowser control on the left is the result of navigation to the target website, and the datagridview on the right is the extracted information .)

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.