C # Crawl page data Parse title description picture information Remove HTML Tags _c# tutorial

Source: Internet
Author: User

First, the Web page content of the entire crawl down, the data put in byte[] (network transmission on the form of byte), further converted to string, so as to facilitate its operation, examples are as follows:

Copy Code code as follows:

private static string Getpagedata (string url)
{
if (url = null | | URL. Trim () = = "")
return null;
WebClient WC = new WebClient ();
Wc. Credentials = CredentialCache.DefaultCredentials;
byte[] Pagedata = WC. Downloaddata (URL);
Return Encoding.Default.GetString (pagedata);//. Ascii. GetString
}

Second, get the data of the string form, and then you can parse the Web page (in fact, the string of various operations and regular expressions of the application):

There are several common resolutions:

1. Get title

Copy Code code as follows:

Match Titlematch = Regex.match (Strresponse, "<title> ([^<]*) </title>", Regexoptions.ignorecase | Regexoptions.multiline);
title = Titlematch.groups[1]. Value;

2. Get descriptive information

Copy Code code as follows:

Match Desc = Regex.match (Strresponse, "<meta name=\" description\ "content=\" ([^<]*) \ ">", Regexoptions.ignorecase | Regexoptions.multiline);
Strdesc = desc.groups[1]. Value;

3. Get Pictures

Copy Code code as follows:

public class HtmlHelper
{
<summary>
Extract picture address in HTML
</summary>
public static list<string> Pickupimgurl (string html)
{
Regex regimg = new Regex (@ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[" "']?[ \s\t\r\n]* (? [^\s\t\r\n "" ' <>]* ") [^<>]*?/?] [\s\t\r\n]*> ", regexoptions.ignorecase);
MatchCollection matches = regimg.matches (HTML);
list<string> lstimg = new list<string> ();
foreach (match match in matches)
{
Lstimg.add (match. groups["Imgurl"]. Value);
}
return lstimg;
}
<summary>
Extract picture address in HTML
</summary>
public static string Pickupimgurlfirst (string html)
{
list<string> lstimg = pickupimgurl (HTML);
return Lstimg.count = = 0? String. EMPTY:LSTIMG[0];
}
}

4. Remove HTML tags

Copy Code code as follows:

private string striphtml (string strhtml)
{
Regex objregexp = new Regex ("<. | \ n) +?> ");
String stroutput = Objregexp.replace (strHTML, "");
Stroutput = Stroutput.replace ("<", "<");
Stroutput = Stroutput.replace (">", ">");
return stroutput;
}

Some exceptions make the removal dirty, so it is recommended to convert it twice in succession. This translates the HTML tags into spaces. Too many contiguous spaces affect the subsequent operation of the string. So add a statement like this:

Copy Code code as follows:

Make all the blanks into a space
Regex r = new Regex (@ "\s+");
Wordsonly = R.replace (Strresponse, "");
Wordsonly.trim ();

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.