How to determine the encoding format of a text file content UTF-8? ANSI (GBK)

Last Update:2015-01-13 Source: Internet

Author: User

Tags 0xc0

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

UTF-8 encoded text documents, some with BOM (byte order mark, byte order flag), that is 0xEF, 0xBB, 0xBF, some do not. The TXT text editor under Windows automatically adds a BOM to the file header when saving a text document in the UTF-8 format. When judging such a document, you can judge it based on the first 3 bytes of the document. However, the BOM is not required and is not recommended. There are compatibility issues with programs that do not want the UTF-8 document to have a BOM, such as when the Java compiler compiles an UTF-8 source file with a BOM. And the BOM removed UTF-8 an expected feature, that is, when the text is all ASCII characters UTF-8 is consistent with the ASCII, that is, UTF-8 backwards-compatible ASCII.

In a specific judgment, if the document does not have a BOM, it cannot be judged according to the BOM, and the Istextunicode API cannot judge the UTF-8 encoded Unicode string. That in the programming to judge according to UTF-8 character encoding law to judge.

UTF-8 is a multibyte-encoded character set that, when represented by a Unicode character, can be 1 to several bytes, with a regular representation:

1 bytes: 0xxxxxxx
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This allows the string to be traversed according to the characteristics above to determine if a string is UTF-8 encoded. It should be noted that the UTF-8 string of the various bytes of the value of a certain range, not all the values are valid UTF-8 characters, but the general application of such judgments in the case of long enough strings and is more accurate, and the implementation is relatively simple. The specific byte value range can be found in the "Unicode explained" Book of 6.4.3. In addition, the BOM itself conforms to the 3-byte UTF-8 character encoding law, so this method is also valid for UTF-8 strings with BOM.

Determine if the file is UTF-8 encoded bool IsUTF8 (const void* pbuffer, long size) {bool IsUTF8 = true;unsigned char* start = (unsigned char*) pbuff er;unsigned char* end = (unsigned char*) pbuffer + size;while (Start < end) {if (*start < 0x80)//(10000000): Value less than 0x80 The ASCII character  {start++;} else if (*start < (0xC0))//(11000000): The value between 0x80 and 0xC0 is invalid UTF-8 character  {IsUTF8 = False;break;} else if (*start < (0xE0))//(11100000): This range is a 2-byte UTF-8 character  {if (start >= end-1) {break;} if ((Start[1] & (0xC0)) = 0x80) {IsUTF8 = False;break;} Start + = 2;} else if (*start < (0xF0))//(11110000): This range is a 3-byte UTF-8 character  {if (start >= end-2) {break;} if ((Start[1] & (0xC0))! = 0x80 | | (Start[2] & (0xC0))! = 0x80) {IsUTF8 = False;break;} Start + = 3;} Else{isutf8 = False;break;}} return IsUTF8;}

In the program, the maximum 3 bytes long UTF-8 characters are judged, in fact, almost all the UTF-8 characters that can be used is the longest is 3 bytes.

How to determine the encoding format of a text file UTF-8? ANSI (GBK)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More