You can use WebClient to obtain the source code of a Web page. However, we found that the character encoding of web pages on different websites is different. How can we automatically separate the character encoding of different websites and correctly interpret Chinese characters on the web pages. C # provides a variety of tool libraries for easy transcoding. However, it is found that the source code cannot be automatically interpreted because the character encoding on the website is not automatically obtained, leading to garbled characters. In the past, I also used Java to obtain the website source code. Similarly, in various Java class libraries that obtain the website source code, the Code cannot be automatically correctly interpreted based on the webpage character encoding, we can only do this manually.
My solution is to first get the source code from stream using the system's default encoding, and then use the regular expression to get the [get the webpage character encoding description information] in the source code. This information, in general, there are all in the web page. In the
The source code is as follows: (the following code is a complete source code for obtaining webpages and automatically interpreting Chinese characters)
Copy and print plain text
Using system. net;
Using system. IO;
Using system. Text. regularexpressions;
* ***** String gethtml (string URL, string charset) // The URL is the address of the website to be accessed, and charset is the encoding of the target webpage, if the input is null or "", the code of the webpage is automatically analyzed.
{
WebClient mywebclient = new WebClient ();// Create a WebClient instance mywebclient
//Note:
// Some webpages may not be available, for various reasons such as cookie and Encoding Problems
// This requires specific problem analysis, such as adding a cookie to the header
// Webclient. Headers. Add ("Cookie", cookie );
// Some overload methods may be required. Write as needed
// Obtain or set the network creden。 used to authenticate requests to Internet resources.
MyWebClient. Credentials = CredentialCache. DefaultCredentials;
// If the server needs to verify the user name and password
// NetworkCredential mycred = new NetworkCredential (struser, strpassword );
// MyWebClient. Credentials = mycred;
// Download data from the resource and return a byte array. (Add @ because there is a "/" symbol in the middle of the URL)
Byte [] myDataBuffer = myWebClient. DownloadData (url );
String strWebData = Encoding. Default. GetString (myDataBuffer );
// Obtain the character encoding description of the webpage
Match charSetMatch = Regex. match (strWebData, "<meta ([^ <] *) charset = ([^ <] *) \" ", RegexOptions. ignoreCase | RegexOptions. multiline );
String webCharSet = charSetMatch. Groups [2]. Value;
If (charSet = null | charSet = "")
CharSet = webCharSet;
If (charSet! = Null & charSet! = "" & Encoding. GetEncoding (charSet )! = Encoding. Default)
StrWebData = Encoding. GetEncoding (charSet). GetString (myDataBuffer );
Return strWebData;
}
Note: * ****** is p rivate.