First, the Web page content of the entire crawl down, the data put in byte[] (network transmission on the form of byte), further converted to string, so as to facilitate its operation, examples are as follows:
Copy Code code as follows:
private static string Getpagedata (string url)
{
if (url = null | | URL. Trim () = = "")
return null;
WebClient WC = new WebClient ();
Wc. Credentials = CredentialCache.DefaultCredentials;
byte[] Pagedata = WC. Downloaddata (URL);
Return Encoding.Default.GetString (pagedata);//. Ascii. GetString
}
Second, get the data of the string form, and then you can parse the Web page (in fact, the string of various operations and regular expressions of the application):
There are several common resolutions:
1. Get title
Copy Code code as follows:
Match Titlematch = Regex.match (Strresponse, "<title> ([^<]*) </title>", Regexoptions.ignorecase | Regexoptions.multiline);
title = Titlematch.groups[1]. Value;
2. Get descriptive information
Copy Code code as follows:
Match Desc = Regex.match (Strresponse, "<meta name=\" description\ "content=\" ([^<]*) \ ">", Regexoptions.ignorecase | Regexoptions.multiline);
Strdesc = desc.groups[1]. Value;
3. Get Pictures
Copy Code code as follows:
public class HtmlHelper
{
<summary>
Extract picture address in HTML
</summary>
public static list<string> Pickupimgurl (string html)
{
Regex regimg = new Regex (@ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[" "']?[ \s\t\r\n]* (? [^\s\t\r\n "" ' <>]* ") [^<>]*?/?] [\s\t\r\n]*> ", regexoptions.ignorecase);
MatchCollection matches = regimg.matches (HTML);
list<string> lstimg = new list<string> ();
foreach (match match in matches)
{
Lstimg.add (match. groups["Imgurl"]. Value);
}
return lstimg;
}
<summary>
Extract picture address in HTML
</summary>
public static string Pickupimgurlfirst (string html)
{
list<string> lstimg = pickupimgurl (HTML);
return Lstimg.count = = 0? String. EMPTY:LSTIMG[0];
}
}
4. Remove HTML tags
Copy Code code as follows:
private string striphtml (string strhtml)
{
Regex objregexp = new Regex ("<. | \ n) +?> ");
String stroutput = Objregexp.replace (strHTML, "");
Stroutput = Stroutput.replace ("<", "<");
Stroutput = Stroutput.replace (">", ">");
return stroutput;
}
Some exceptions make the removal dirty, so it is recommended to convert it twice in succession. This translates the HTML tags into spaces. Too many contiguous spaces affect the subsequent operation of the string. So add a statement like this:
Copy Code code as follows:
Make all the blanks into a space
Regex r = new Regex (@ "\s+");
Wordsonly = R.replace (Strresponse, "");
Wordsonly.trim ();