Website collection (extract HTML data based on regular expressions)

Source: Internet
Author: User

All websites can obtain the HTML source code compiled by the website through the URL. The method is as follows:

Namespace to be used:

Using system;

Using system. Collections. Generic;

Using system. text;

Using system. diagnostics;

Using system. Text. regularexpressions;

Using system. IO;

Using system. net;

/// <Summary>

/// Obtain the webpage source code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Param name = "charset"> webpage encoding, eg: "UTF-8" </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL, string charset)

{

// Encoding

Encoding nowcharset;

If (charset = "" | charset = NULL)

{

Nowcharset = encoding. default;

}

Else

{

Nowcharset = encoding. getencoding (charset );

}

 

// Process content

String html = "";

Try

{

// Webrequest mywebrequest = webrequest. Create (URL );

// Webresponse mywebresponse = mywebrequest. getresponse ();

// Stream = mywebresponse. getresponsestream ();

// Streamreader reader = new streamreader (stream, nowcharset );

 

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, nowcharset );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

 

/// <Summary>

/// Obtain the webpage source code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Param name = "charset"> webpage code, eg: encoding. utf8 </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL, encoding charset)

{

// Process content

String html = "";

Try

{

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, charset );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

 

/// <Summary>

/// Obtain the webpage source code

/// Effective for webpages with Bom, which can be correctly identified by any code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL)

{

// Process content

String html = "";

Try

{

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, encoding. Default );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

 

You can call different methods to obtain data according to different situations, for example:

String _ html = collection. gethtmlsource ("http://www.luohx.com/a.html", "UTF-8 ");

You can also include parameters in URL parameters, such

String _ html = collection. gethtmlsource ("http://www.luohx.com/a.aspx? A = 1 & B = 2 "," UTF-8 ");

 

After collecting the website source code, we will find that our requirements are often not required by any code, but only part of them, such: the tag <Div id = "XML" class "Wrap"> </div> contains HTML, so we need to intercept the source code as follows:

# Region obtain the album page code

Public String strhtml (string URL, string charset)

{

String _ html = collection. gethtmlsource (URL, charset); // obtain the website HTML according to the URL

String SSS = "";

// Regular Expression

String Pattern = @"(? Six) <Div \ s + id = "" XML "" \ s + class = "" Wrap "">

(? 'Mycont'

(?>

(?! <Div \ B | </div> ).

|

<Div (? : \ S + (? : "" [^ ""] * "" | '[^'] * '| [^ ""'>]) *)?> (? 'Div ')

|

</Div> (? '-Div ')

)*

(? (Div )(?!))

)

</Div> ";

Foreach (Match m in RegEx. Matches (_ HTML, pattern ))

{

SSS = M. Groups ["mycont"]. value;

}

 

Return SSS;

}

# Endregion

The pattern parameter is a regular expression for the tag <Div id = "XML" class "Wrap"> </div>. However, it must be ensured that, the unique format of the selected reference object. Two or more <Div id = "XML" class "Wrap"> </div> cannot exist at the same time, this label cannot be used as a reference.

When we intercept the required HTML code module, we find that we still get some HTML code. If we need content that does not contain HTML elements, remove HTML tags, such :,

Public static string checkstr (string HTML)

{

System. text. regularexpressions. regEx regex1 = new system. text. regularexpressions. regEx (@ "<SCRIPT [\ s] + </script *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex2 = new system. text. regularexpressions. regEx (@ "href * = * [\ s] * script *:", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex3 = new system. text. regularexpressions. regEx (@ "No [\ s] * =", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex4 = new system. text. regularexpressions. regEx (@ "<IFRAME [\ s] + </iframe *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex5 = new system. text. regularexpressions. regEx (@ "<frameset [\ s] + </frameset *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex6 = new system. text. regularexpressions. regEx (@ "\ ] + \>", system. text. regularexpressions. regexoptions. ignorecase); system. text. regularexpressions. regEx regex7 = new system. text. regularexpressions. regEx (@ "</P>", system. text. regularexpressions. regexoptions. ignorecase );

System. Text. regularexpressions. RegEx regex8 = new system. Text. regularexpressions. RegEx (@ "<p>", system. Text. regularexpressions. regexoptions. ignorecase );

System. Text. regularexpressions. RegEx regex9 = new system. Text. regularexpressions. RegEx (@ "<[^>] *>", system. Text. regularexpressions. regexoptions. ignorecase );

Html = regex1.replace (HTML ,"");

Html = regex2.replace (HTML ,"");

Html = regex3.replace (HTML, "_ disibledevent = ");

Html = regex4.replace (HTML ,"");

Html = regex5.replace (HTML ,"");

Html = regex6.replace (HTML ,"");

Html = regex7.replace (HTML ,"");

Html = regex8.replace (HTML ,"");

Html = regex9.replace (HTML ,"");

Html = html. Replace ("","");

Html = html. Replace ("</strong> ","");

Html = html. Replace ("<strong> ","");

Return HTML;

}

The call method is simple. You can simply use string strhtml = checkstr (HTML). When you get the required data, you can import the data to the database, display it, and other operations ~

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.