Text document encoding recognition method, text document encoding

Last Update:2016-11-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text document encoding Recognition Method

When reading documents, we often encounter incorrect encoding formats. How to correctly identify the encoding formats of documents has become a heart disease for many programmers. Today I will try to cure this heart disease, this code is concentrated from Data Analysis of tens of millions of documents, with an extremely high compression rate.

At the request of a friend, I needed to help him with an article operation tool. Since I wanted to operate the file, I had to read and modify the file. It took me a few hours to submit the program to my friend, A friend suddenly came up with a sentence. Many articles were opened with garbled characters. I was like a blue sky and a bang in my heart. I suddenly thought of the file encoding problem, I have tried countless times and eventually ended up failing. Each attempt only reduces the error probability, but it is not enough to make up for the correctness of file encoding format analysis, this time, my friend raised the encoding problem, and I was in a mess.

If we don't solve this problem, the tools we provide to our friends do not play any role. I have had a big meal at home two days ago. Can we still spit it out? In this case, I lost my big face. In desperation, I asked my friends how many files are there? Answer: tens of millions. In an instant, I have a bright eye. Let's analyze massive data.

During mass data analysis, I used a stupid method to read all the file header data, such as reading four bytes, then, read the first one hundred words of the file content (Unicode, UnicodeBigEndian, UTF8, ANSI, and so on) and identify them with the naked eye, for example

Public class Info {

Public int ch0; // the first character

Public int second; // second character

Public int ch2; // The third character

Public int ch3; // The fourth character

Public string UnicodeStr; // The first 100 words

Public string UnicodeBigEndianStr; // The first 100 words

Public string UTF8Str; // The first 100 words

Public string ANSIStr; // The first 100 words

}

Lambda is used for sorting. I personally suggest sorting UnicodeStr, UnicodeBigEndianStr, UTF8Str, and ANSIStr because the recognizable character encoding has a certain range. After sorting, all Chinese characters that can be recognized must be merged;

Then, we can classify ch0, shards, ch2, and ch3 in detail to see the relationships between them. Through observation, I can also find something. through induction and summarization, the TEXT encoding can be identified as follows:

Using System; using System. collections. generic; using System. linq; using System. text; using System. IO; namespace document operation tool {public class TextHelper {public static System. text. encoding GetType (string filename) {FileStream fs = new FileStream (filename, FileMode. open, FileAccess. read); System. text. encoding r = GetType (fs); fs. close (); return r;} public static System. text. encoding GetType (FileStream fs) {/* Unicode ---------------- 255 254 ===============================unicodebigendian ----------------- 254 255 ====== ================= UTF8 --------------------- 34 228 34 229 34 230 34 231 34 232 34 233 34 239 187 ====== ============== ANSI --------------------- 34 176 34 177 34 179 34 180 34 182 34 185 34 191 34 194 34 196 34 198 34 205 34 206 34 208 34 209 34 210 34 211 213 196 167 202 213 */BinaryReader r = new BinaryReader (fs, system. text. encoding. default); byte [] ss = r. readBytes (3); int lef = ss [0]; int mid = ss [1]; int rig = ss [2]; r. close ();/* The two bytes in the file header are 255 254, Unicode encoding; the three bytes in the file header are 254 255 0, UTF-16BE encoding; the three bytes in the file header are 239 187, is UTF-8 encoded; */if (lef = 255 & mid = 254) {return Encoding. unicode;} else if (lef = 254 & mid = 255 & rig = 0) {return Encoding. bigEndianUnicode;} else if (lef = 254 & mid = 255) {return Encoding. bigEndianUnicode;} else if (lef = 239 & mid = 187 & rig = 191) {return Encoding. UTF8;} else if (lef = 239 & mid = 187) {return Encoding. UTF8 ;} else if (lef = 196 & mid = 167 | lef = 206 & mid = 228 | lef = 202 & mid = 213) {return Encoding. default;} else {if (lef = 34) {if (mid <220) return Encoding. default; else return Encoding. utf8;} else {if (lef <220) return Encoding. default; else return Encoding. UTF8 ;}}}}}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Text document encoding recognition method, text document encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Text document encoding recognition method, text document encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support