Text document encoding recognition method, text document encoding

Source: Internet
Author: User

Text document encoding recognition method, text document encoding

Text document encoding Recognition Method

When reading documents, we often encounter incorrect encoding formats. How to correctly identify the encoding formats of documents has become a heart disease for many programmers. Today I will try to cure this heart disease, this code is concentrated from Data Analysis of tens of millions of documents, with an extremely high compression rate.

At the request of a friend, I needed to help him with an article operation tool. Since I wanted to operate the file, I had to read and modify the file. It took me a few hours to submit the program to my friend, A friend suddenly came up with a sentence. Many articles were opened with garbled characters. I was like a blue sky and a bang in my heart. I suddenly thought of the file encoding problem, I have tried countless times and eventually ended up failing. Each attempt only reduces the error probability, but it is not enough to make up for the correctness of file encoding format analysis, this time, my friend raised the encoding problem, and I was in a mess.

If we don't solve this problem, the tools we provide to our friends do not play any role. I have had a big meal at home two days ago. Can we still spit it out? In this case, I lost my big face. In desperation, I asked my friends how many files are there? Answer: tens of millions. In an instant, I have a bright eye. Let's analyze massive data.

During mass data analysis, I used a stupid method to read all the file header data, such as reading four bytes, then, read the first one hundred words of the file content (Unicode, UnicodeBigEndian, UTF8, ANSI, and so on) and identify them with the naked eye, for example

Public class Info {

Public int ch0; // the first character

Public int second; // second character

Public int ch2; // The third character

Public int ch3; // The fourth character

Public string UnicodeStr; // The first 100 words

Public string UnicodeBigEndianStr; // The first 100 words

Public string UTF8Str; // The first 100 words

Public string ANSIStr; // The first 100 words

}

 

Lambda is used for sorting. I personally suggest sorting UnicodeStr, UnicodeBigEndianStr, UTF8Str, and ANSIStr because the recognizable character encoding has a certain range. After sorting, all Chinese characters that can be recognized must be merged;

Then, we can classify ch0, shards, ch2, and ch3 in detail to see the relationships between them. Through observation, I can also find something. through induction and summarization, the TEXT encoding can be identified as follows:

 

 

Using System; using System. collections. generic; using System. linq; using System. text; using System. IO; namespace document operation tool {public class TextHelper {public static System. text. encoding GetType (string filename) {FileStream fs = new FileStream (filename, FileMode. open, FileAccess. read); System. text. encoding r = GetType (fs); fs. close (); return r;} public static System. text. encoding GetType (FileStream fs) {/* Unicode ---------------- 255 254 ===============================unicodebigendian ----------------- 254 255 ====== ================= UTF8 --------------------- 34 228 34 229 34 230 34 231 34 232 34 233 34 239 187 ====== ============== ANSI --------------------- 34 176 34 177 34 179 34 180 34 182 34 185 34 191 34 194 34 196 34 198 34 205 34 206 34 208 34 209 34 210 34 211 213 196 167 202 213 */BinaryReader r = new BinaryReader (fs, system. text. encoding. default); byte [] ss = r. readBytes (3); int lef = ss [0]; int mid = ss [1]; int rig = ss [2]; r. close ();/* The two bytes in the file header are 255 254, Unicode encoding; the three bytes in the file header are 254 255 0, UTF-16BE encoding; the three bytes in the file header are 239 187, is UTF-8 encoded; */if (lef = 255 & mid = 254) {return Encoding. unicode;} else if (lef = 254 & mid = 255 & rig = 0) {return Encoding. bigEndianUnicode;} else if (lef = 254 & mid = 255) {return Encoding. bigEndianUnicode;} else if (lef = 239 & mid = 187 & rig = 191) {return Encoding. UTF8;} else if (lef = 239 & mid = 187) {return Encoding. UTF8 ;} else if (lef = 196 & mid = 167 | lef = 206 & mid = 228 | lef = 202 & mid = 213) {return Encoding. default;} else {if (lef = 34) {if (mid <220) return Encoding. default; else return Encoding. utf8;} else {if (lef <220) return Encoding. default; else return Encoding. UTF8 ;}}}}}

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.