C # obtain the webpage source code and automatically determine the webpage character set encoding

Source: Internet
Author: User

You can use WebClient to obtain the source code of a Web page. However, we found that the character encoding of web pages on different websites is different. How can we automatically separate the character encoding of different websites and correctly interpret Chinese characters on the web pages. C # provides a variety of tool libraries for easy transcoding. However, it is found that the source code cannot be automatically interpreted because the character encoding on the website is not automatically obtained, leading to garbled characters. In the past, I also used Java to obtain the website source code. Similarly, in various Java class libraries that obtain the website source code, the Code cannot be automatically correctly interpreted based on the webpage character encoding, we can only do this manually.

My solution is to first get the source code from stream using the system's default encoding, and then use the regular expression to get the [get the webpage character encoding description information] in the source code. This information, in general, there are all in the web page. In the

The source code is as follows: (the following code is a complete source code for obtaining webpages and automatically interpreting Chinese characters)

Copy and print plain text
Using system. net;
Using system. IO;
Using system. Text. regularexpressions;
* ***** String gethtml (string URL, string charset) // The URL is the address of the website to be accessed, and charset is the encoding of the target webpage, if the input is null or "", the code of the webpage is automatically analyzed.
{
WebClient mywebclient = new WebClient ();// Create a WebClient instance mywebclient
//Note:
// Some webpages may not be available, for various reasons such as cookie and Encoding Problems
// This requires specific problem analysis, such as adding a cookie to the header
// Webclient. Headers. Add ("Cookie", cookie );
// Some overload methods may be required. Write as needed

// Obtain or set the network creden。 used to authenticate requests to Internet resources.
MyWebClient. Credentials = CredentialCache. DefaultCredentials;
// If the server needs to verify the user name and password
// NetworkCredential mycred = new NetworkCredential (struser, strpassword );
// MyWebClient. Credentials = mycred;
// Download data from the resource and return a byte array. (Add @ because there is a "/" symbol in the middle of the URL)
Byte [] myDataBuffer = myWebClient. DownloadData (url );
String strWebData = Encoding. Default. GetString (myDataBuffer );

// Obtain the character encoding description of the webpage
Match charSetMatch = Regex. match (strWebData, "<meta ([^ <] *) charset = ([^ <] *) \" ", RegexOptions. ignoreCase | RegexOptions. multiline );
String webCharSet = charSetMatch. Groups [2]. Value;
If (charSet = null | charSet = "")
CharSet = webCharSet;

If (charSet! = Null & charSet! = "" & Encoding. GetEncoding (charSet )! = Encoding. Default)
StrWebData = Encoding. GetEncoding (charSet). GetString (myDataBuffer );
Return strWebData;
}

Note: * ****** is p rivate.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.