Two ways to analyze HTML Web pages

Source: Internet
Author: User
Tags object net regular expression string client
Web page

Someone wants to pull the Web page down and extract the contents. This is actually a search engine's most basic work: Download, extract, and then download. I did a search engine project in my early years, but the code is gone. This time someone asked me this thing again, I have saved two methods.

method A, download the Web Page in a WinForm with a hidden browser control, and use IHTMLDocument to analyze the content. This method is relatively simple, but if the analysis of a large number of files is slow.

The main code used in this method is as follows:
private void Button1_Click (object sender, System.EventArgs e) {
Object Url= "http://www.google.com";
Object Nothing=null;
This.axWebBrowser1.Navigate2 (ref url,ref Nothing,ref nothing,ref nothing,ref Nothing);
This.axwebbrowser1.downloadcomplete+=new System.EventHandler (This.button2_click);
}

private void Button2_Click (object sender, System.EventArgs e) {
This.textbox1.text= "";
Mshtml. IHTMLDocument2 doc= (mshtml. IHTMLDocument2) this.axWebBrowser1.Document;
Mshtml. IHTMLElementCollection All=doc.all;
System.Collections.IEnumerator Enumerator=all. GetEnumerator ();
while (enumerator. MoveNext () && Enumerator. Current!=null)
{
Mshtml. IHTMLElement element= (mshtml. IHTMLElement) (Enumerator. Current);
if (this.checkbox1.checked==true)
{
this.textbox1.text+= "\r\n\r\n" +element.innerhtml;
}
Else
{
this.textbox1.text+= "\r\n\r\n" +element.outerhtml;
}
}
}

Method B, use the System.Net.WebClient download Web page to save to a local file or string to parse with a regular expression. This method can be used in many web page applications such as web crawler and so on.

Here is an example that can extract all the hyperlink from the http://www.google.com home page:

Using System;
Using System.Net;
Using System.Text;
Using System.Text.RegularExpressions;

namespace httpget{
Class class1{
[STAThread]
static void Main (string[] args) {
System.Net.WebClient client=new WebClient ();
Byte[] Page=client. Downloaddata ("http://www.google.com");
String content=system.text.encoding.utf8.getstring (page);
String regex= "href=[\\\"\\\'] (http:\\/\\/|\\.\\/|\\/) \\w+ (\\.\\w+) * (\\/\\w+ (\\.\\w+)?) * (\\/|\\?\\w*=\\w* (&\\w*=\\w*) *)? [\\\"\\\']";
Regex re=new Regex (regex);
MatchCollection Matches=re. Matches (content);

System.Collections.IEnumerator enu=matches. GetEnumerator ();
while (ENU. MoveNext () && enu. Current!=null)
{
Match match= (match) (ENU. Current);
Console.Write (match. Value+ "\ r \ n");
}
}
}
}

The real crawler, are using regular expressions to do the extraction, you can find some open source Crawler, the code is similar. Just a bit higher, you can extract the URL from flash or JavaScript.

To add one, I was asked if an element was drawn with document.write and could be fetched in the DOM. The answer is yes. The specific method is also the same as usual. The following HTML shows the HTML Tag that is dynamically generated from the DOM using document.write:

<FORM>
<SCRIPT>
document.write ("<input type=button id= ' btn1 ' value= ' button 1 ' >");
</SCRIPT>
<input onclick=show () Type=button value= "click me" >
</FORM>
<textarea Id=allnode rows=29 cols=53></textarea>
<SCRIPT>
Function Show ()
{
Document.All.Item ("Allnode"). innertext= "";
var i=0;
for (i=0;i<document.forms[0].childnodes.length;i++)
{
Document.All.Item ("Allnode"). Innertext=document.all.item ("Allnode"). innertext+ "\ r \ n" +document.forms[0]. Childnodes[i].tagname+ "" +document.forms[0].childnodes[i].value;
}
}
</SCRIPT>

After clicking "Click Me", "button 1" is in the list of the child elements of the printed forms[0.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.