Use C # webbrowser and application. doevents () to collect dynamic web pages.

Source: Internet
Author: User

Author: finallyliuyu)

 

It has been more than a year since the undergraduate course started to capture and collect network data. It started with a regular expression for static web pages to capture information from the network. However, as the work goes deeper,
It is found that many web pages cannot be captured simply by using regular expressions. For example, the next page links of many web pages are generated by JavaScript Functions, such
<Li> <a href = "#" onclick = "javascript: gotopage ('2')"> 2 </a> </LI>, even if you use a regular expression to extract href, you cannot obtain the next page Link.
In addition, if there is a "#" field in the URL, the webpage source code stream obtained using httpresponse and httprequest is different from the page view you see in the browser. Therefore, use a regular expression only, then, it seems powerless to process dynamic web pages with JS scripts.
What should I do?

Dom + Regular Expression + browser components can be used to solve the above problems.

Dom (Document Object Model) is an interface standard that parses HTML webpages into a tree format. For details about Dom tutorials, see: http://www.w3.org/DOM/ although the above is about Javascript DOM interface function, but because Dom is an interface standard, DOM interface implemented by other languages is also similar.

Regular Expression: It plays an indispensable role in completing text matching. Dom cannot be replaced by this powerful tool.

Browser components: contains the function of interpreting JS statements. With the help of browser components, our work will be more effort-saving (In addition, some netizens in the garden suggested XPath and webrequest, etc., which have never been used, if you are familiar with this, you may wish to talk about it)

This function uses the vs2008 C # winform Platform

To call regular expressions on this platform, you must add a statement in the program header:

Using system. Text. regularexpressions;

To call the DOM component, you must add Microsoft. mshtml to the reference of the project.

The browser component uses webbrowser.

 

First, we need to construct a simple browser in the program. We need to have a ComboBox list box (displaying the URL of the current webpage), the forward and backward buttons, and control the browser to refresh the view. The implementation code is as follows:

 

Simple browser forward and backward functions in the program

Private void btngo_click (Object sender, eventargs E)
{
String url = combobox1.text. Trim ();

Webbrowser1.navigate (URL );
}

Private void btnback_click (Object sender, eventargs E)
{
Webbrowser1.goback ();
}

 

It is not enough to move forward or backward. We hope that after the browser view is refreshed, the URL in ComboBox will also be refreshed. Therefore, we need to add a navigated event to the browser to update the text displayed in ComboBox. The Code is as follows:

 

Private void webbrowserinclunavigated (Object sender, webbrowsernavigatedeventargs E)
{
Combobox1.text = webbrowser1.url. tostring ();
}

 

This is not enough. When you implement the above Code, you will find that when you click the link in webbrowser, a new webpage will be displayed in local IE, therefore, we need to add a newwindow event Code as follows:

 

Webbrowser newwindow

Private void webbrowser1_newwindow (Object sender, canceleventargs E)
{
E. Cancel = true;
If (webbrowser1.document. activeelement! = NULL)
{
Webbrowser1.navigate (webbrowser1.document. activeelement. getattribute ("href "));
Combobox1.text = webbrowser1.document. activeelement. getattribute ("href ");
}

}

 

After the above Code is implemented, a simple IE browser is configured in the program. The remaining question is how to design the crawler logic to form an automatic crawler robot (Here we declare: This blog article only provides a framework of an automatic crawler robot, for how to capture specific web page information blocks Boi (Block of interest), you also need to configure different templates based on the specific situation of the web page.

For your convenience, the following are my task requirements:

 

The level-1 index page contains several links to the level-2 index page. The level-2 index page also contains several links to the body page. Our aim is to obtain the link pointing to the level-2 index page from the level-1 index page, and traverse all secondary index pages, extract the link to the body, and save it. The difficulty lies in that the primary index page points to the homepage of the secondary index page, and the secondary index page also has several subsequent pages. The homepage of the secondary index page and subsequent pages have links to the body page. For example, a small Forum (which can be regarded as a level-1 index page) has three sections (which are regarded as level-2 index pages), food, and it. Different sections are divided into more than N pages, and each page has a forward link to the specific content (body page.

Therefore, you need to flip the page down on the homepage of the secondary index page. The second-level index page is shown as a link to the next page:

 

 

Or:

In addition, all URLs are generated using JavaScript Functions, so there is no way to use regular expressions.

My idea is this: locate the current page, use Dom to locate the anchor on the next page of the current page, and simulate clicks.

The method is to first locate the page number of the current page (this is not difficult to do, because all the page number link information is in one block, and the current page has no link to the current page ), then retrieve all the pages on the current page and sort them in ascending order,

Compare the relationship between curpageid + 1 and the maximum page number of the current page. If curpage + 1 <maxpageid, the next page of the current page can be located; otherwise, check whether there is any next ATAG. If yes, next is the next page of the current page. If there is no next ATAG, it indicates that the last page has been reached.

 

To complete the above functions, the most complex part is to deal with webbrowser asynchronous update problem, on the Internet to find a number of information, I think the better or: http://www.hackpig.cn/post/28.html

The method in this article is to refer to the blog content in the link and make improvements. Next, let's talk about the working mechanism of my program.

In actual work, only two buttons are used: journapmap and buildwokflow.

Press journalmap to obtain a global data structure that stores the home address of each secondary index page

Click buildworkflow, The crawler starts to automatically traverse the home addresses of all the secondary index pages, and crawls the URL of the body pages.

In work, you must first press the journalmap button to prompt that the Homepage Address of the secondary index page is extracted, and then press the buildworkflow button to make the program work automatically.

To ensure the program running logic, declare four signal variables for the form

 

Public bool mysignal1; // whether the btnworkflow button is clicked
Public bool mysignal2;
Public bool loading; // The communication button between the workflow button and webbrowser
Public bool subloading;

 

 

Perform the following initial assignment:

 

Initial signal variable assignment

Public form1 ()
{
Initializecomponent ();
Mysignal1 = false;
Mysignal2 = false;
Loading = true;
Subloading = true;
Issuesmap = new list <string> ();
}

 

 

The following shows the code for buidworkflowand webbrowser.doc umentcompleted to see how the two interact.

Workflow button code, click this button, the crawler automatically works

Private void btnworkflow_click (Object sender, eventargs E)
{
Mysignal1 = true;
List <articlepage> arlistcurrentpage;
Foreach (string s in issuesmap)
{

Loading = true;
String tmpurl = s;
Webbrowser1.navigate (tmpurl );
While (loading = true)
{
Application. doevents ();
}


Arlistcurrentpage = getarticlepageinfofromcurrentdirpage ();
If (arlistcurrentpage! = NULL)
{
Inserttitleurltodatabase (arlistcurrentpage );
}

Mysignal2 = true;
While (anchornextpage ())
{
Subloading = true;
While (subloading)
{
Application. doevents ();

}
Arlistcurrentpage = getarticlepageinfofromcurrentdirpage ();
If (arlistcurrentpage! = NULL)
{
Inserttitleurltodatabase (arlistcurrentpage );
}

}
Mysignal2 = false;

// Obtain the next page link of the current page


}


}

 

 

 

 

Webbrowsercompleted update Signal

Private void webbrowserappsdocumentcompleted (Object sender, webbrowserdocumentcompletedeventargs E)
{
If (webbrowser1.readystate = webbrowserreadystate. Complete)
{
If (mysignal1)
{
If (! Mysignal2)
{
Loading = false;
}
Else
{
Subloading = false;
}

}


}





}

 

Explains how btnworkflow interacts with webbrowser. Press the btnworkflow button, and the value of mysingal1 is true. At this time, after the webbrowser document is loaded, the value of loading is false, so that the Code following the btnworkflow loop can be executed, extract the body URL on the homepage of the secondary index page and save it to the database. Then, the value of mysignal2 is true. In this case, assign a false value to subloading after the webbrowser document is loaded, this enables the btnworkflow subcycle to continuously run, repeatedly extract the URL link of the current page, and go to the next page until no next page can be turned over. The btnworkflow subloop exits, and mysignal2 is assigned a false value, after the webbrowser document is loaded, the loading is updated to false so that the content of the next secondary index homepage can be extracted.

 

When I complete this function, I refer to a lot of code snippets on the network. Many code snippets are in the webbrowser_documentcompleted function to parse the content and get the next page Link, code for turning pages. Do this

It is easy to duplicate information extraction (that is, a page is extracted two to three times ).

Finally, the Code for using Dom to click the next page in the program is provided:

 

Click the function next to the current page

Private bool anchornextpage ()
{Bool rststatus = false;
The code in the middle of ...... uses regular expressions and Dom functions (such as getelementbytagname and getelementbyid) to locate the next page of the current page.
If (htmlelemnext! = NULL)
{

Mshtml. ihtmlelement anchor = (mshtml. ihtmlelement) htmlelemnext. domelement;
Anchor. Click (); // simulate a click
Rststatus = true;

}
Return rststatus;


}

 

Appendix: I would like to thank several garden friends such as Naber QianYu bishui hantan for answering questions for me.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.