Text document encoding recognition method, text document encoding
Text document encoding Recognition Method
When reading documents, we often encounter incorrect encoding formats. How to correctly identify the encoding formats of documents has become a heart disease for many programmers. Today I will try to cure this heart disease, this code is concentrated from Data Analysis of tens of millions of documents, with an extremely high compression rate.
At the request of a friend, I needed to help him with an article operation tool. Since I wanted to operate the file, I had to read and modify the file. It took me a few hours to submit the program to my friend, A friend suddenly came up with a sentence. Many articles were opened with garbled characters. I was like a blue sky and a bang in my heart. I suddenly thought of the file encoding problem, I have tried countless times and eventually ended up failing. Each attempt only reduces the error probability, but it is not enough to make up for the correctness of file encoding format analysis, this time, my friend raised the encoding problem, and I was in a mess.
If we don't solve this problem, the tools we provide to our friends do not play any role. I have had a big meal at home two days ago. Can we still spit it out? In this case, I lost my big face. In desperation, I asked my friends how many files are there? Answer: tens of millions. In an instant, I have a bright eye. Let's analyze massive data.
During mass data analysis, I used a stupid method to read all the file header data, such as reading four bytes, then, read the first one hundred words of the file content (Unicode, UnicodeBigEndian, UTF8, ANSI, and so on) and identify them with the naked eye, for example
Public class Info {
Public int ch0; // the first character
Public int second; // second character
Public int ch2; // The third character
Public int ch3; // The fourth character
Public string UnicodeStr; // The first 100 words
Public string UnicodeBigEndianStr; // The first 100 words
Public string UTF8Str; // The first 100 words
Public string ANSIStr; // The first 100 words
}
Lambda is used for sorting. I personally suggest sorting UnicodeStr, UnicodeBigEndianStr, UTF8Str, and ANSIStr because the recognizable character encoding has a certain range. After sorting, all Chinese characters that can be recognized must be merged;
Then, we can classify ch0, shards, ch2, and ch3 in detail to see the relationships between them. Through observation, I can also find something. through induction and summarization, the TEXT encoding can be identified as follows:
Using System; using System. collections. generic; using System. linq; using System. text; using System. IO; namespace document operation tool {public class TextHelper {public static System. text. encoding GetType (string filename) {FileStream fs = new FileStream (filename, FileMode. open, FileAccess. read); System. text. encoding r = GetType (fs); fs. close (); return r;} public static System. text. encoding GetType (FileStream fs) {/* Unicode ---------------- 255 254 ===============================unicodebigendian ----------------- 254 255 ====== ================= UTF8 --------------------- 34 228 34 229 34 230 34 231 34 232 34 233 34 239 187 ====== ============== ANSI --------------------- 34 176 34 177 34 179 34 180 34 182 34 185 34 191 34 194 34 196 34 198 34 205 34 206 34 208 34 209 34 210 34 211 213 196 167 202 213 */BinaryReader r = new BinaryReader (fs, system. text. encoding. default); byte [] ss = r. readBytes (3); int lef = ss [0]; int mid = ss [1]; int rig = ss [2]; r. close ();/* The two bytes in the file header are 255 254, Unicode encoding; the three bytes in the file header are 254 255 0, UTF-16BE encoding; the three bytes in the file header are 239 187, is UTF-8 encoded; */if (lef = 255 & mid = 254) {return Encoding. unicode;} else if (lef = 254 & mid = 255 & rig = 0) {return Encoding. bigEndianUnicode;} else if (lef = 254 & mid = 255) {return Encoding. bigEndianUnicode;} else if (lef = 239 & mid = 187 & rig = 191) {return Encoding. UTF8;} else if (lef = 239 & mid = 187) {return Encoding. UTF8 ;} else if (lef = 196 & mid = 167 | lef = 206 & mid = 228 | lef = 202 & mid = 213) {return Encoding. default;} else {if (lef = 34) {if (mid <220) return Encoding. default; else return Encoding. utf8;} else {if (lef <220) return Encoding. default; else return Encoding. UTF8 ;}}}}}