Website collection (extract HTML data based on regular expressions)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

All websites can obtain the HTML source code compiled by the website through the URL. The method is as follows:

Namespace to be used:

Using system;

Using system. Collections. Generic;

Using system. text;

Using system. diagnostics;

Using system. Text. regularexpressions;

Using system. IO;

Using system. net;

/// <Summary>

/// Obtain the webpage source code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Param name = "charset"> webpage encoding, eg: "UTF-8" </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL, string charset)

{

// Encoding

Encoding nowcharset;

If (charset = "" | charset = NULL)

{

Nowcharset = encoding. default;

}

Else

{

Nowcharset = encoding. getencoding (charset );

}

// Process content

String html = "";

Try

{

// Webrequest mywebrequest = webrequest. Create (URL );

// Webresponse mywebresponse = mywebrequest. getresponse ();

// Stream = mywebresponse. getresponsestream ();

// Streamreader reader = new streamreader (stream, nowcharset );

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, nowcharset );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

/// <Summary>

/// Obtain the webpage source code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Param name = "charset"> webpage code, eg: encoding. utf8 </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL, encoding charset)

{

// Process content

String html = "";

Try

{

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, charset );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

/// <Summary>

/// Obtain the webpage source code

/// Effective for webpages with Bom, which can be correctly identified by any code

/// </Summary>

/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>

/// <Returns> return the webpage source file </returns>

Public static string gethtmlsource (string URL)

{

// Process content

String html = "";

Try

{

Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );

Httpwebresponse response = (httpwebresponse) request. getresponse ();

Stream stream = response. getresponsestream ();

Streamreader reader = new streamreader (stream, encoding. Default );

Html = reader. readtoend ();

Stream. Close ();

}

Catch (exception E)

{

}

Return HTML;

}

You can call different methods to obtain data according to different situations, for example:

String _ html = collection. gethtmlsource ("http://www.luohx.com/a.html", "UTF-8 ");

You can also include parameters in URL parameters, such

String _ html = collection. gethtmlsource ("http://www.luohx.com/a.aspx? A = 1 & B = 2 "," UTF-8 ");

After collecting the website source code, we will find that our requirements are often not required by any code, but only part of them, such: the tag <Div id = "XML" class "Wrap"> </div> contains HTML, so we need to intercept the source code as follows:

# Region obtain the album page code

Public String strhtml (string URL, string charset)

{

String _ html = collection. gethtmlsource (URL, charset); // obtain the website HTML according to the URL

String SSS = "";

// Regular Expression

String Pattern = @"(? Six) <Div \ s + id = "" XML "" \ s + class = "" Wrap "">

(? 'Mycont'

(?>

(?! <Div \ B | </div> ).

<Div (? : \ S + (? : "" [^ ""] * "" | '[^'] * '| [^ ""'>]) *)?> (? 'Div ')

</Div> (? '-Div ')

(? (Div )(?!))

)

</Div> ";

Foreach (Match m in RegEx. Matches (_ HTML, pattern ))

{

SSS = M. Groups ["mycont"]. value;

}

Return SSS;

}

# Endregion

The pattern parameter is a regular expression for the tag <Div id = "XML" class "Wrap"> </div>. However, it must be ensured that, the unique format of the selected reference object. Two or more <Div id = "XML" class "Wrap"> </div> cannot exist at the same time, this label cannot be used as a reference.

When we intercept the required HTML code module, we find that we still get some HTML code. If we need content that does not contain HTML elements, remove HTML tags, such :,

Public static string checkstr (string HTML)

{

System. text. regularexpressions. regEx regex1 = new system. text. regularexpressions. regEx (@ "<SCRIPT [\ s] + </script *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex2 = new system. text. regularexpressions. regEx (@ "href * = * [\ s] * script *:", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex3 = new system. text. regularexpressions. regEx (@ "No [\ s] * =", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex4 = new system. text. regularexpressions. regEx (@ "<IFRAME [\ s] + </iframe *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex5 = new system. text. regularexpressions. regEx (@ "<frameset [\ s] + </frameset *>", system. text. regularexpressions. regexoptions. ignorecase );

System. text. regularexpressions. regEx regex6 = new system. text. regularexpressions. regEx (@ "\ ] + \>", system. text. regularexpressions. regexoptions. ignorecase); system. text. regularexpressions. regEx regex7 = new system. text. regularexpressions. regEx (@ "</P>", system. text. regularexpressions. regexoptions. ignorecase );

System. Text. regularexpressions. RegEx regex8 = new system. Text. regularexpressions. RegEx (@ "<p>", system. Text. regularexpressions. regexoptions. ignorecase );

System. Text. regularexpressions. RegEx regex9 = new system. Text. regularexpressions. RegEx (@ "<[^>] *>", system. Text. regularexpressions. regexoptions. ignorecase );

Html = regex1.replace (HTML ,"");

Html = regex2.replace (HTML ,"");

Html = regex3.replace (HTML, "_ disibledevent = ");

Html = regex4.replace (HTML ,"");

Html = regex5.replace (HTML ,"");

Html = regex6.replace (HTML ,"");

Html = regex7.replace (HTML ,"");

Html = regex8.replace (HTML ,"");

Html = regex9.replace (HTML ,"");

Html = html. Replace ("","");

Html = html. Replace ("</strong> ","");

Html = html. Replace ("<strong> ","");

Return HTML;

}

The call method is simple. You can simply use string strhtml = checkstr (HTML). When you get the required data, you can import the data to the database, display it, and other operations ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Website collection (extract HTML data based on regular expressions)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Website collection (extract HTML data based on regular expressions)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support