Network Information Collector

Last Update:2018-12-05 Source: Internet

Author: User

Tags first string

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Network Information Collector is used to collect information resources that cannot be sent from other websites on the network. It is very convenient and convenient for information collectors. The following is a tutorial on the Network Information Collector:

When creating the Network Information Collector, we will use the HttpWebRequest and HttpWebResponse classes in the network namespace, and I/O technology in the basic namespace. I will write it below.

// Collect the website address
String url = "http://www.baidu.com /";

// Create an access target
HttpWebRequest webRequeset = (HttpWebRequest) WebRequest. Create (url );

// Get a response
HttpWebResponse webResponse = (HttpWebResponse) webRequeset. GetResponse ();

// Obtain the data stream
Stream stream = webResponse. GetResponseStream ();

// Parse the obtained data stream encoding so that we can read data normally.
StreamReader streamReader = new StreamReader (stream, Encoding. GetEncoding ("gb2312 "));

// Read and retrieve the information in the data stream
String content = streamReader. ReadToEnd ();

// Close the stream
StreamReader. Close ();

// Disable the network response stream
WebResponse. Close ();

In the above method, we will get the data returned from the network. At this time, we will perform regular matching selection.

Public class GetRemoteObj
{

# Region reads Network Content Based on the URL

/// <Summary>
/// Read the network content according to the URL
/// </Summary>
/// <Param name = "url"> </param>
/// <Returns> </returns>
Public string GetRemoteHtmlCode (string url)
{
HttpWebRequest wrequest = (HttpWebRequest) WebRequest. Create (url );
HttpWebResponse wresponse = (HttpWebResponse) wrequest. GetResponse ();
Stream stream = wresponse. GetResponseStream ();
StreamReader reader = new StreamReader (stream, Encoding. GetEncoding ("gb2312 "));
String HtmlCode = reader. ReadToEnd ();
Reader. Close ();
Wresponse. Close ();
Return HtmlCode;
}

# Endregion

# Region replaces line breaks and quotation marks in the webpage

/// <Summary>
/// Replace the line breaks and quotation marks in the webpage
/// </Summary>
/// <Param name = "HtmlCode"> </param>
/// <Returns> </returns>
Public string ReplaceEnter (string HtmlCode)
{
String s = "";
If (HtmlCode = null | HtmlCode = "")
S = "";
Else
S = HtmlCode. Replace ("\"","");
S = s. Replace ("\ r ","");
S = s. Replace ("\ n ","");
Return s;
}

# Endregion

# Region execute the regular expression to extract the value

/// <Summary>
/// Execute the regular expression to extract the value
/// </Summary>
/// <Param name = "RegexString"> Regular Expression </param>
/// <Param name = "RemoteStr"> html source code </param>
/// <Returns> </returns>
Public string GetRegValue (string RegexString, string RemoteStr)
{
String MatchVale = "";
Regex r = new Regex (RegexString );
Match m = r. Match (RemoteStr );
If (m. Success)
{
MatchVale = m. Value;
}
Return MatchVale;
}

# Endregion

# Region delete HTML tags

/// <Summary>
/// Delete HTML tags

/// <Param name = "HtmlCode"> html source code </param>
/// <Returns> </returns>
Public string RemoveHTML (string HtmlCode)
{
String MatchVale = HtmlCode;
Foreach (Match s in Regex. Matches (HtmlCode, "<. +?> "))
{
MatchVale = MatchVale. Replace (s. Value ,"");
}
Return MatchVale;
}

# Endregion

# Region

/// <Summary>
/// Obtain the Page Link regular
/// </Summary>
/// <Param name = "HtmlCode"> html source code </param>
/// <Returns> </returns>
Public string GetHref (string HtmlCode)
{
String MatchVale = "";
String Reg = @ "(h | H) (r | R) (e | E) (f | F) * = * ('| "")? (\ W | \/| \. |: |-| _) +) [\ S] * ";
Foreach (Match m in Regex. Matches (HtmlCode, Reg ))
{
MatchVale + = (m. Value). ToLower (). Replace ("href =", ""). Trim () + "| ";
}
Return MatchVale;
}

# Endregion

# Region match the actual link of the image path in

/// <Summary>
/// Match the actual link of the image path in
/// </Summary>
/// <Param name = "ImgString"> string </param>
/// <Param name = "imgHttp"> </param>
/// <Returns> </returns>
Public string GetImg (string ImgString, string imgHttp)
{
String MatchVale = "";
String Reg = @ "src =. + \. (bmp | jpg | gif | png | )";
Foreach (Match m in Regex. Matches (ImgString. ToLower (), Reg ))
{
MatchVale + = (m. Value). ToLower (). Trim (). Replace ("src = ","");
}
If (MatchVale. IndexOf (". net ")! =-1 | MatchVale. IndexOf (". com ")! =-1 | MatchVale. IndexOf (". org ")! =-1 | MatchVale. IndexOf (". cn ")! =-1 | MatchVale. IndexOf (". cc ")! =-1 | MatchVale. IndexOf (". info ")! =-1 | MatchVale. IndexOf (". biz ")! =-1 | MatchVale. IndexOf (". TV ")! =-1)

Return (MatchVale );
Else
Return (imgHttp + MatchVale );
}

# Endregion

# Region matching the image address on the page

/// <Summary>
/// Match the image address on the page
/// </Summary>
/// <Param name = "HtmlCode"> html source code </param>
/// <Param name = "imgHttp"> http: // path information to be supplemented </param>
/// <Returns> </returns>
Public string GetImgSrc (string HtmlCode, string imgHttp)
{
String MatchVale = "";
String Reg = @ " ";

Foreach (Match m in Regex. Matches (HtmlCode. ToLower (), Reg ))
{
MatchVale + = GetImg (m. Value). ToLower (). Trim (), imgHttp) + "| ";
}

Return MatchVale;
}

# Endregion

# Region Replace the regular expression that matches the first and end of the string with the regular expression

/// <Summary>
/// Replace the regular expression that matches the first and end of the string with the regular expression
/// </Summary>
/// <Param name = "RegValue"> value to be replaced </param>
/// <Param name = "regStart"> first string of Regular Expression matching </param>
/// <Param name = "regEnd"> the end string of the regular expression </param>
/// <Returns> </returns>
Public string RegReplace (string RegValue, string regStart, string regEnd)
{
String s = RegValue;
If (RegValue! = "" & RegValue! = Null)
{
If (regStart! = "" & RegStart! = Null)
{
S = s. Replace (regStart ,"");
}
If (regEnd! = "" & RegEnd! = Null)
{
S = s. Replace (regEnd ,"");
}
}
Return s;
}

# Endregion

}

I am sorry for everyone. I have been delaying this article for a long time. Today I finally took the time out. I decided to fix this article:

I will explain a few important aspects of the above method (because I use dial-up Internet access at home, I basically use Baidu as an example)

GetRemoteHtmlCode this method is used to obtain the source code in the web page, we only need to pass in the URL address (note that the incoming URL must be added with http, such as Baidu, We need to write http://www.baidu.com)

GetRegValue this method is mainly used to extract key information (this method is very important and can be said to be the key to the collection system. I will give an example below)

RemoveHTML this method is mainly used to delete HTML tags.

ReplaceEnter this method is mainly used to replace line breaks and quotation marks

Below I will write an example for you to see:

GetRemoteObj getUrl = new GetRemoteObj (); // instantiate the operation class
String url = "http://www.baidu.com/"; // This is our URL
String content = getUrl. GetRemoteHtmlCode (url); // obtain the html source code.
String Reg = "<title>. +? </Title> "; // obtain the title based on the title tag regular expression.
String title = getUrl. GetRegValue (Reg, content); // retrieve the title. The title here contains the title tag.
Title = getUrl. RemoveHTML (title); // delete an html Tag
Console. WriteLine (title); // print the output. The title content is displayed here.

In the preceding example, we can write that when we choose to intercept content, we may encounter a line feed problem. In this case, we can replace the line feed identifier of the other party first to intercept the content normally,

This is actually very simple. I have summarized some of my experiences:

First, many people have made a mistake about the meaning of multi-thread collection. Here I will explain that when we collect web page information, there may be a lot of data in a page that we need to store in data,

For example, Baidu, but the information we searched out is that Baidu showed us on pages. At this time, we can perform multi-threaded collection on this page, using delegation and events, let's make the data displayed on this page. For example, if a page has 10 data records, we can start 10 threads and collect content at the same time, after all the data is processed, we can call the second page of data to enable the thread for simultaneous collection.

Second, we need to mark every page we have collected. I usually put the path of this page into the database and compare it before collection, I don't know if there are any better methods. If so, please advise.

Third: the server is the best machine to collect information. The biggest advantage of the server is the network and cpu resources. When we enable multithreading, network resources and cpu resources are greatly consumed. If it is a poor machine, I recommend that you collect one item better, otherwise it will be easy to get down.

The above is basically the basis of the information collection system. The information network developed by the front-end time is initially used. If you are interested, you can collect data from a website. I personally think it is good, I don't have a better solution to reusability. This is depressing. The collection can only target the information of a website and cannot be used in general. If you have time, you can get a deeper understanding.

Here, I am sorry to all of you. This article has been dragging on for a while and is very busy. Some of you asked me for an example. To tell the truth, apart from the collector of that website, I have not made an example yet. Khan ....

The next time I write an article, I will first make an example. Sorry !!!!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More