Web page
Someone wants to pull the Web page down and extract the contents. This is actually a search engine's most basic work: Download, extract, and then download. I did a search engine project in my early years, but the code is gone. This time someone asked me this thing again, I have saved two methods.
method A, download the Web Page in a WinForm with a hidden browser control, and use IHTMLDocument to analyze the content. This method is relatively simple, but if the analysis of a large number of files is slow.
The main code used in this method is as follows:
private void Button1_Click (object sender, System.EventArgs e) {
Object Url= "http://www.google.com";
Object Nothing=null;
This.axWebBrowser1.Navigate2 (ref url,ref Nothing,ref nothing,ref nothing,ref Nothing);
This.axwebbrowser1.downloadcomplete+=new System.EventHandler (This.button2_click);
}
private void Button2_Click (object sender, System.EventArgs e) {
This.textbox1.text= "";
Mshtml. IHTMLDocument2 doc= (mshtml. IHTMLDocument2) this.axWebBrowser1.Document;
Mshtml. IHTMLElementCollection All=doc.all;
System.Collections.IEnumerator Enumerator=all. GetEnumerator ();
while (enumerator. MoveNext () && Enumerator. Current!=null)
{
Mshtml. IHTMLElement element= (mshtml. IHTMLElement) (Enumerator. Current);
if (this.checkbox1.checked==true)
{
this.textbox1.text+= "\r\n\r\n" +element.innerhtml;
}
Else
{
this.textbox1.text+= "\r\n\r\n" +element.outerhtml;
}
}
}
Method B, use the System.Net.WebClient download Web page to save to a local file or string to parse with a regular expression. This method can be used in many web page applications such as web crawler and so on.
Here is an example that can extract all the hyperlink from the http://www.google.com home page:
Using System;
Using System.Net;
Using System.Text;
Using System.Text.RegularExpressions;
namespace httpget{
Class class1{
[STAThread]
static void Main (string[] args) {
System.Net.WebClient client=new WebClient ();
Byte[] Page=client. Downloaddata ("http://www.google.com");
String content=system.text.encoding.utf8.getstring (page);
String regex= "href=[\\\"\\\'] (http:\\/\\/|\\.\\/|\\/) \\w+ (\\.\\w+) * (\\/\\w+ (\\.\\w+)?) * (\\/|\\?\\w*=\\w* (&\\w*=\\w*) *)? [\\\"\\\']";
Regex re=new Regex (regex);
MatchCollection Matches=re. Matches (content);
System.Collections.IEnumerator enu=matches. GetEnumerator ();
while (ENU. MoveNext () && enu. Current!=null)
{
Match match= (match) (ENU. Current);
Console.Write (match. Value+ "\ r \ n");
}
}
}
}
The real crawler, are using regular expressions to do the extraction, you can find some open source Crawler, the code is similar. Just a bit higher, you can extract the URL from flash or JavaScript.
To add one, I was asked if an element was drawn with document.write and could be fetched in the DOM. The answer is yes. The specific method is also the same as usual. The following HTML shows the HTML Tag that is dynamically generated from the DOM using document.write:
<FORM>
<SCRIPT>
document.write ("<input type=button id= ' btn1 ' value= ' button 1 ' >");
</SCRIPT>
<input onclick=show () Type=button value= "click me" >
</FORM>
<textarea Id=allnode rows=29 cols=53></textarea>
<SCRIPT>
Function Show ()
{
Document.All.Item ("Allnode"). innertext= "";
var i=0;
for (i=0;i<document.forms[0].childnodes.length;i++)
{
Document.All.Item ("Allnode"). Innertext=document.all.item ("Allnode"). innertext+ "\ r \ n" +document.forms[0]. Childnodes[i].tagname+ "" +document.forms[0].childnodes[i].value;
}
}
</SCRIPT>
After clicking "Click Me", "button 1" is in the list of the child elements of the printed forms[0.