It has been more than a year since the undergraduate course started to capture and collect network data. It started with a regular expression for static web pages to capture information from the network. However, as the work goes deeper,
It is found that many web pages cannot be captured simply by using regular expressions. For example, the next page links of many web pages are generated by JavaScript Functions, such
<Li> <a href = "#" onclick = "javascript: gotoPage (2)"> 2 </a> </li>, even if you use a regular expression, if href is extracted, the next page cannot be obtained.
In addition, if there is a "#" field in the url, the webpage source code stream obtained using httpresponse and httprequest is different from the page view you see in the browser. Therefore, use a regular expression only, then, it seems powerless to process dynamic web pages with js scripts.
What should I do?
DOM + Regular Expression + browser components can be used to solve the above problems.
DOM (Document Object Model) is an interface standard that parses html webpages into a tree format. For details about DOM tutorials, see: http://www.w3.org/DOM/ although the above is about JavaScript DOM interface function, but because DOM is an interface standard, DOM interface implemented by other languages is also similar.
Regular Expression: It plays an indispensable role in completing text matching. DOM cannot be replaced by this powerful tool.
Browser components: contains the function of interpreting JS statements. With the help of browser components, our work will be more effort-saving (In addition, some netizens in the garden suggested Xpath and webrequest, etc., which have never been used, if you are familiar with this, you may wish to talk about it)
This function uses the VS2008 C # Winform Platform
To call regular expressions on this platform, you must add a statement in the program header:
Using System. Text. RegularExpressions;
To call the DOM component, you must add Microsoft. mshtml to the reference of the project.
The browser component uses webbrowser.
First, we need to construct a simple browser in the program. We need to have a combobox list box (displaying the URL of the current webpage), the forward and backward buttons, and control the browser to refresh the view. The implementation code is as follows:
Simple browser forward and backward functions in the program Private void btnGo_Click (object sender, EventArgs e)
{
String url = comboBox1.Text. Trim ();
WebBrowser1.Navigate (url );
}
Private void btnBack_Click (object sender, EventArgs e)
{
WebBrowser1.GoBack ();
}
It is not enough to move forward or backward. We hope that after the browser view is refreshed, the URL in combobox will also be refreshed. Therefore, we need to add a Navigated event to the browser to update the text displayed in combobox. The Code is as follows:
Private void webbrowserinclunavigated (object sender, WebBrowserNavigatedEventArgs e)
{
ComboBox1.Text = webBrowser1.Url. ToString ();
}
This is not enough. When you implement the above Code, you will find that when you click the link in webbrowser, a new webpage will be displayed in local IE, therefore, we need to add a NewWindow event Code as follows:
Webbrowser NewWindow Private void webBrowser1_NewWindow (object sender, CancelEventArgs e)
{
E. Cancel = true;
If (webBrowser1.Document. ActiveElement! = Null)
{
WebBrowser1.Navigate (webBrowser1.Document. ActiveElement. GetAttribute ("