A friend asked:
"I need to get the title of a web page, because the title is generally relatively top, as long as you get a small amount of content in front of html. It is a waste of time to download the entire html code to obtain the title of many pages .. Net does not seem to have a ready-made class to do this (get part of html). How should I implement it ?"
A solution that is relatively "cheap" (under a low cost:
Step 1: retrieve the minimum set containing the page title. This is the key to "cheap!
Step 2: Use a regular expression to obtain the part between <title> and </title>.
First look at the effect:
The following is an analysis:
The page title is generally close to the beginning, so we read the Stream from the beginning. (what if it is close to the end ?) Where can I read it? The obvious sign is:
</Title>
It is enough to end with it.
How to read data? Here I select row-by-row reading. When a flag is obtained, it is terminated.
The method is as follows:
View plaincopy to clipboardprint?
# Region get the required page content
/// <Summary>
/// Obtain the required page content by tony 2009.9, 16
/// Downmoon: 3w@live.cn
/// <Param name = "strUrl"> remote webpage address to be searched </param>
/// <Param name = "timeout"> set the timeout period, which is generally set to 8000 </param>
/// <Param name = "enterType"> whether to output a line break. 0 indicates no output. 1 indicates a line break in the output text box. </param>
/// <Param name = "EnCodeType"> encoding method </param>
/// <Returns> </returns>
Public static string GetRequestString (string strUrl, int timeout, int enterType, Encoding EnCodeType)
{
If (strUrl. Equals ("about: blank") return null ;;
If (! StrUrl. StartsWith ("http ://")&&! StrUrl. StartsWith ("https: //") {strUrl = "http: //" + strUrl ;}
String strResult = string. Empty;
System. IO. StreamReader sr = null;
String temp = string. Empty;
Try
{
HttpWebRequest myReq = (HttpWebRequest) HttpWebRequest. Create (strUrl );
MyReq. Timeout = timeout;
MyReq. userAgent = "User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1 ;. net clr 2.0.40607 ;. net clr 1.1.4322 ;. net clr 3.5.30729 )";
MyReq. Accept = "*/*";
MyReq. KeepAlive = true;
MyReq. Headers. Add ("Accept-Language", "zh-cn, en-us; q = 0.5 ");
HttpWebResponse HttpWResp = (HttpWebResponse) myReq. GetResponse ();
If (HttpWResp. StatusCode = System. Net. HttpStatusCode. OK)
{
StringBuilder strBuilder = new StringBuilder ();
Stream myStream = HttpWResp. GetResponseStream ();
Sr = new StreamReader (myStream, EnCodeType );
String tmp = string. Empty;
While (temp = sr. ReadLine ())! = Null)
{
StrBuilder. Append (temp );
// If has </title> then end by <a title = "" href = "http://blog.csdn.net/downmoon/" mce_href = "http://blog.csdn.net/downmoon/"> welcome to the, net technology and software architecture </a> (invitation month) 2009.9.16
Tmp = strBuilder. ToString ();
If (tmp. IndexOf ("</title>")> 0) {break ;}
If (enterType = 1) {strBuilder. Append ("");}
}
StrResult = strBuilder. ToString ();
Return strResult;
}
Return string. Empty;
}
Catch (Exception ex)
{
// # Region Loghandle by Tony 2008.11.21
Return strResult;
& N