All websites can obtain the HTML source code compiled by the website through the URL. The method is as follows:
Namespace to be used:
Using system;
Using system. Collections. Generic;
Using system. text;
Using system. diagnostics;
Using system. Text. regularexpressions;
Using system. IO;
Using system. net;
/// <Summary>
/// Obtain the webpage source code
/// </Summary>
/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>
/// <Param name = "charset"> webpage encoding, eg: "UTF-8" </param>
/// <Returns> return the webpage source file </returns>
Public static string gethtmlsource (string URL, string charset)
{
// Encoding
Encoding nowcharset;
If (charset = "" | charset = NULL)
{
Nowcharset = encoding. default;
}
Else
{
Nowcharset = encoding. getencoding (charset );
}
// Process content
String html = "";
Try
{
// Webrequest mywebrequest = webrequest. Create (URL );
// Webresponse mywebresponse = mywebrequest. getresponse ();
// Stream = mywebresponse. getresponsestream ();
// Streamreader reader = new streamreader (stream, nowcharset );
Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );
Httpwebresponse response = (httpwebresponse) request. getresponse ();
Stream stream = response. getresponsestream ();
Streamreader reader = new streamreader (stream, nowcharset );
Html = reader. readtoend ();
Stream. Close ();
}
Catch (exception E)
{
}
Return HTML;
}
/// <Summary>
/// Obtain the webpage source code
/// </Summary>
/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>
/// <Param name = "charset"> webpage code, eg: encoding. utf8 </param>
/// <Returns> return the webpage source file </returns>
Public static string gethtmlsource (string URL, encoding charset)
{
// Process content
String html = "";
Try
{
Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );
Httpwebresponse response = (httpwebresponse) request. getresponse ();
Stream stream = response. getresponsestream ();
Streamreader reader = new streamreader (stream, charset );
Html = reader. readtoend ();
Stream. Close ();
}
Catch (exception E)
{
}
Return HTML;
}
/// <Summary>
/// Obtain the webpage source code
/// Effective for webpages with Bom, which can be correctly identified by any code
/// </Summary>
/// <Param name = "url"> webpage address, eg: "http://www.xxx.com/" </param>
/// <Returns> return the webpage source file </returns>
Public static string gethtmlsource (string URL)
{
// Process content
String html = "";
Try
{
Httpwebrequest request = (httpwebrequest) webrequest. Create (URL );
Httpwebresponse response = (httpwebresponse) request. getresponse ();
Stream stream = response. getresponsestream ();
Streamreader reader = new streamreader (stream, encoding. Default );
Html = reader. readtoend ();
Stream. Close ();
}
Catch (exception E)
{
}
Return HTML;
}
You can call different methods to obtain data according to different situations, for example:
String _ html = collection. gethtmlsource ("http://www.luohx.com/a.html", "UTF-8 ");
You can also include parameters in URL parameters, such
String _ html = collection. gethtmlsource ("http://www.luohx.com/a.aspx? A = 1 & B = 2 "," UTF-8 ");
After collecting the website source code, we will find that our requirements are often not required by any code, but only part of them, such: the tag <Div id = "XML" class "Wrap"> </div> contains HTML, so we need to intercept the source code as follows:
# Region obtain the album page code
Public String strhtml (string URL, string charset)
{
String _ html = collection. gethtmlsource (URL, charset); // obtain the website HTML according to the URL
String SSS = "";
// Regular Expression
String Pattern = @"(? Six) <Div \ s + id = "" XML "" \ s + class = "" Wrap "">
(? 'Mycont'
(?>
(?! <Div \ B | </div> ).
|
<Div (? : \ S + (? : "" [^ ""] * "" | '[^'] * '| [^ ""'>]) *)?> (? 'Div ')
|
</Div> (? '-Div ')
)*
(? (Div )(?!))
)
</Div> ";
Foreach (Match m in RegEx. Matches (_ HTML, pattern ))
{
SSS = M. Groups ["mycont"]. value;
}
Return SSS;
}
# Endregion
The pattern parameter is a regular expression for the tag <Div id = "XML" class "Wrap"> </div>. However, it must be ensured that, the unique format of the selected reference object. Two or more <Div id = "XML" class "Wrap"> </div> cannot exist at the same time, this label cannot be used as a reference.
When we intercept the required HTML code module, we find that we still get some HTML code. If we need content that does not contain HTML elements, remove HTML tags, such :,
Public static string checkstr (string HTML)
{
System. text. regularexpressions. regEx regex1 = new system. text. regularexpressions. regEx (@ "<SCRIPT [\ s] + </script *>", system. text. regularexpressions. regexoptions. ignorecase );
System. text. regularexpressions. regEx regex2 = new system. text. regularexpressions. regEx (@ "href * = * [\ s] * script *:", system. text. regularexpressions. regexoptions. ignorecase );
System. text. regularexpressions. regEx regex3 = new system. text. regularexpressions. regEx (@ "No [\ s] * =", system. text. regularexpressions. regexoptions. ignorecase );
System. text. regularexpressions. regEx regex4 = new system. text. regularexpressions. regEx (@ "<IFRAME [\ s] + </iframe *>", system. text. regularexpressions. regexoptions. ignorecase );
System. text. regularexpressions. regEx regex5 = new system. text. regularexpressions. regEx (@ "<frameset [\ s] + </frameset *>", system. text. regularexpressions. regexoptions. ignorecase );
System. text. regularexpressions. regEx regex6 = new system. text. regularexpressions. regEx (@ "\ ] + \>", system. text. regularexpressions. regexoptions. ignorecase); system. text. regularexpressions. regEx regex7 = new system. text. regularexpressions. regEx (@ "</P>", system. text. regularexpressions. regexoptions. ignorecase );
System. Text. regularexpressions. RegEx regex8 = new system. Text. regularexpressions. RegEx (@ "<p>", system. Text. regularexpressions. regexoptions. ignorecase );
System. Text. regularexpressions. RegEx regex9 = new system. Text. regularexpressions. RegEx (@ "<[^>] *>", system. Text. regularexpressions. regexoptions. ignorecase );
Html = regex1.replace (HTML ,"");
Html = regex2.replace (HTML ,"");
Html = regex3.replace (HTML, "_ disibledevent = ");
Html = regex4.replace (HTML ,"");
Html = regex5.replace (HTML ,"");
Html = regex6.replace (HTML ,"");
Html = regex7.replace (HTML ,"");
Html = regex8.replace (HTML ,"");
Html = regex9.replace (HTML ,"");
Html = html. Replace ("","");
Html = html. Replace ("</strong> ","");
Html = html. Replace ("<strong> ","");
Return HTML;
}
The call method is simple. You can simply use string strhtml = checkstr (HTML). When you get the required data, you can import the data to the database, display it, and other operations ~