C # obtain the webpage source code and automatically determine the webpage character set encoding

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You can use WebClient to obtain the source code of a Web page. However, we found that the character encoding of web pages on different websites is different. How can we automatically separate the character encoding of different websites and correctly interpret Chinese characters on the web pages. C # provides a variety of tool libraries for easy transcoding. However, it is found that the source code cannot be automatically interpreted because the character encoding on the website is not automatically obtained, leading to garbled characters. In the past, I also used Java to obtain the website source code. Similarly, in various Java class libraries that obtain the website source code, the Code cannot be automatically correctly interpreted based on the webpage character encoding, we can only do this manually.

My solution is to first get the source code from stream using the system's default encoding, and then use the regular expression to get the [get the webpage character encoding description information] in the source code. This information, in general, there are all in the web page. In the

The source code is as follows: (the following code is a complete source code for obtaining webpages and automatically interpreting Chinese characters)

Copy and print plain text
Using system. net;
Using system. IO;
Using system. Text. regularexpressions;
* ***** String gethtml (string URL, string charset) // The URL is the address of the website to be accessed, and charset is the encoding of the target webpage, if the input is null or "", the code of the webpage is automatically analyzed.
{
WebClient mywebclient = new WebClient ();// Create a WebClient instance mywebclient
//Note:
// Some webpages may not be available, for various reasons such as cookie and Encoding Problems
// This requires specific problem analysis, such as adding a cookie to the header
// Webclient. Headers. Add ("Cookie", cookie );
// Some overload methods may be required. Write as needed

// Obtain or set the network creden。 used to authenticate requests to Internet resources.
MyWebClient. Credentials = CredentialCache. DefaultCredentials;
// If the server needs to verify the user name and password
// NetworkCredential mycred = new NetworkCredential (struser, strpassword );
// MyWebClient. Credentials = mycred;
// Download data from the resource and return a byte array. (Add @ because there is a "/" symbol in the middle of the URL)
Byte [] myDataBuffer = myWebClient. DownloadData (url );
String strWebData = Encoding. Default. GetString (myDataBuffer );

// Obtain the character encoding description of the webpage
Match charSetMatch = Regex. match (strWebData, "<meta ([^ <] *) charset = ([^ <] *) \" ", RegexOptions. ignoreCase | RegexOptions. multiline );
String webCharSet = charSetMatch. Groups [2]. Value;
If (charSet = null | charSet = "")
CharSet = webCharSet;

If (charSet! = Null & charSet! = "" & Encoding. GetEncoding (charSet )! = Encoding. Default)
StrWebData = Encoding. GetEncoding (charSet). GetString (myDataBuffer );
Return strWebData;
}

Note: * ****** is p rivate.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # obtain the webpage source code and automatically determine the webpage character set encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support