Recently in doing a project, one of the functions is to obtain the source code of the Web page according to a URL address. in ASP. NET (C #), get web page source code there seems to be a lot of ways, I casually engaged in a simple WebClient, very simple and easy. But later a very annoying problem came out, that is the Chinese garbled.
By careful study, the Chinese web page is GB2312 and UTF-8 these two kinds of code. So here's the code:
<summary>////Based on URL of URL, get source code HTML///</summary>//<param name= "url" ></ param>//<returns></returns> public static string Gethtmlbyurl (string url) { using (WebClient WC = new WebClient ()) {try {WC. useDefaultCredentials = true; Wc. Proxy = new WebProxy (); Wc. Proxy.credentials = CredentialCache.DefaultCredentials; Wc. Credentials = System.Net.CredentialCache.DefaultCredentials; byte[] bt = WC. Downloaddata (URL); String txt = System.Text.Encoding.GetEncoding ("GB2312"). GetString (BT); Switch (getcharset (TXT). ToUpper ()) {case "UTF-8": txt = System.Text.Encoding.UT F8. GetString (BT); Break Case "UNICODE": txt = System.Text.Encoding.Unicode.GetString (BT); Break Default:break; } return txt; } catch (Exception ex) {return null; } } }
To explain a little bit, the WebClient created a WC object (which is a bit awkward to name). It then calls the Downloaddata method of the WC object, passing in the URL value, returning a byte array. By default, GB2312 is used to read this byte array and convert it to a string. Find the encoded character of the webpage from the string of the source code of the webpage, such as Find charset= "Utf-8", to determine the encoding format of the current Web page.
Getcharset This function is to get the current page encoding format, the specific code is as follows:
<summary>/// get CharSet from HTML///</summary>// <param name= "html" ></param > //<returns></returns> public static string Getcharset (string html) { string charset = ""; Regex regcharset = new Regex (@ "content=[" "'].*\s*charset\b\s*=\s*" "? <charset>[^ "" ']*) ", regexoptions.ignorecase); if (Regcharset.ismatch (HTML)) { charset = regcharset.match (HTML). Groups["CharSet"]. Value; } if (CharSet. Equals ("")) { regcharset = new Regex (@ "<\s*meta\s*charset\s*=\s*[" "'"]? <charset>[^ "" ']*) ", regexoptions.ignorecase); if (Regcharset.ismatch (HTML)) { charset = regcharset.match (HTML). Groups["CharSet"]. Value; } } return charset; }
More examples of using C # to get HTML source pages related articles please follow topic.alibabacloud.com!