How to determine the encoding format of a text file content UTF-8? ANSI (GBK)

Source: Internet
Author: User
Tags 0xc0

UTF-8 encoded text documents, some with BOM (byte order mark, byte order flag), that is 0xEF, 0xBB, 0xBF, some do not. The TXT text editor under Windows automatically adds a BOM to the file header when saving a text document in the UTF-8 format. When judging such a document, you can judge it based on the first 3 bytes of the document. However, the BOM is not required and is not recommended. There are compatibility issues with programs that do not want the UTF-8 document to have a BOM, such as when the Java compiler compiles an UTF-8 source file with a BOM. And the BOM removed UTF-8 an expected feature, that is, when the text is all ASCII characters UTF-8 is consistent with the ASCII, that is, UTF-8 backwards-compatible ASCII.

In a specific judgment, if the document does not have a BOM, it cannot be judged according to the BOM, and the Istextunicode API cannot judge the UTF-8 encoded Unicode string. That in the programming to judge according to UTF-8 character encoding law to judge.

UTF-8 is a multibyte-encoded character set that, when represented by a Unicode character, can be 1 to several bytes, with a regular representation:

1 bytes: 0xxxxxxx
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This allows the string to be traversed according to the characteristics above to determine if a string is UTF-8 encoded. It should be noted that the UTF-8 string of the various bytes of the value of a certain range, not all the values are valid UTF-8 characters, but the general application of such judgments in the case of long enough strings and is more accurate, and the implementation is relatively simple. The specific byte value range can be found in the "Unicode explained" Book of 6.4.3. In addition, the BOM itself conforms to the 3-byte UTF-8 character encoding law, so this method is also valid for UTF-8 strings with BOM.

Determine if the file is UTF-8 encoded bool IsUTF8 (const void* pbuffer, long size) {bool IsUTF8 = true;unsigned char* start = (unsigned char*) pbuff er;unsigned char* end = (unsigned char*) pbuffer + size;while (Start < end) {if (*start < 0x80)//(10000000): Value less than 0x80 The ASCII character  {start++;} else if (*start < (0xC0))//(11000000): The value between 0x80 and 0xC0 is invalid UTF-8 character  {IsUTF8 = False;break;} else if (*start < (0xE0))//(11100000): This range is a 2-byte UTF-8 character  {if (start >= end-1) {break;} if ((Start[1] & (0xC0)) = 0x80) {IsUTF8 = False;break;} Start + = 2;} else if (*start < (0xF0))//(11110000): This range is a 3-byte UTF-8 character  {if (start >= end-2) {break;} if ((Start[1] & (0xC0))! = 0x80 | | (Start[2] & (0xC0))! = 0x80) {IsUTF8 = False;break;} Start + = 3;} Else{isutf8 = False;break;}} return IsUTF8;}

In the program, the maximum 3 bytes long UTF-8 characters are judged, in fact, almost all the UTF-8 characters that can be used is the longest is 3 bytes.


How to determine the encoding format of a text file UTF-8? ANSI (GBK)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.