Because of the acquisition needs, to solve the problem of page coding identification. Many methods have been put forward online. For example, according to the file header byte judgment, or according to the page charset identifier.
I am in the actual application, these methods have their own shortcomings, such as some Web page charset write is GBK, but the actual is UTF8.
So think of a person think more fresh method. After the HTML download, make a copy of the UTF8 and a copy of the GBK, and then convert UTF8 to bytes, determine whether there is a garbled identity bytes (three consecutive byte is 239 191 189), if any, it is garbled, directly use GBK, if not, It means that there is no garbled, use UTF8 directly.
Here's a look at my code:
Get HTML
var data = new System.Net.WebClient {}. Downloaddata (This.textBox1.Text); Download HTML var R_utf8 = new System.IO.StreamReader (new System.IO.MemoryStream (data), Encoding.UTF8) according to TextBox1 URL; /put HTML into UTF8 encoded StreamReader var r_gbk = new System.IO.StreamReader (new System.IO.MemoryStream (data), Encoding.default); Place the HTML in GBK encoded StreamReader var T_utf8 = R_utf8. ReadToEnd (); read out the HTML content var T_GBK = r_gbk. ReadToEnd (); Read the HTML content if (!isluan (T_UTF8))//To determine if the UTF8 is garbled { htm = T_utf8; This. Text = "UTF8"; } else { htm = T_GBK; This. Text = "GBK"; } This.textBox2.Text = htm;
Determine if there are garbled characters
BOOL Isluan (string txt) { var bytes = Encoding.UTF8.GetBytes (txt); 239 191 189 for (var i = 0; i < bytes. Length; i++) { if (I < bytes. Length-3) if (bytes[i] = = 239 && bytes[i + 1] = = 191 && bytes[i + 2] = = 189) { return true;< c10/>} } return false; }
C # get Web page source code, automatically determine the new method of encoding format! Go