Determine whether a file is encoded in UTF-8 or gb2312.

Source: Internet
Author: User

It is very easy to determine the encoding method in texts that only contain Chinese and English characters. The most common encoding method for Chinese characters is GBK, and the larger character set, such as GBK, is backward compatible with gb2312, many of these characters are not used in daily life. Therefore, we generally only need to distinguish between gb2312 and utf8 encoding. Here I only provide a feasible method. If you determine GBK, you can use a similar method to analyze the encoding method of Chinese Characters in gb2312, gb2312 uses dual-byte encoding for Chinese characters. The first byte is 161 ~ 247, second byte 161 ~ 254, which contains the boundary condition. The UTF-8 encoding method can be described as follows:

 

Code Scope
Hexadecimal
Scalar Value)
Binary
UTF-8
Binary/hexadecimal
Note
000000-00007f
128 Codes
00000000 00000000 0 zzzzzzz 0 zzzzzzz (00-7f) ASCII character range, starting from zero
Seven Z Seven Z
000080-0007ff
1920 Codes
00000000 00000yyy yyzzzzzz 110 yyyyy (C0-DF) 10 Zzzzzz (80-bf) The first byte starts from 110, And the next byte starts from 10.
Three y; two Y; six Z Five y; six Z
000800-00d7ff
00e000-00 FFFF
61440 Codes[NOTE 1]
00000000 xxxxyyyy yyzzzzzz 1110 xxxx (E0-EF) 10 yyyyyy 10 Zzzzzz The first byte starts from 1110, And the next byte starts from 10.
Four x; four Y; two Y; six Z Four x; six Y; six Z
010000-10 FFFF
1048576 Codes
000 wwwxx xxxxyyyy yyzzzzzz 11110www (F0-F7) 10 xxxxxx 10 yyyyyy 10 zzzzzzzz Starts from 11110, And the next byte starts from 10.

In this way, we can identify gb2312 and utf8 by differences in encoding methods. The following code is provided:

Unsigned int countgbk (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar, secondchar; for (unsigned int I = 0; I <len-1; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; secondchar = (unsigned char) STR [I]; if (firstchar> = 161 & firstchar <= 247 & secondchar> = 161 & secondchar <= 254) {counter ++ 2; ++ I ;}} return counter;} unsigned int countutf8 (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar; for (unsigned int I = 0; I <Len; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; unsigned char tmphead = head; unsigned int wordlen = 0, TPOs = 0; while (firstchar & tmphead) {++ wordlen; tmphead >>=1;} If (wordlen <= 1) continue; // The minimum utf8 length is 2 wordlen --; If (wordlen + I> = Len) break; for (TPOs = 1; TPOs <= wordlen; ++ TPOs) {unsigned char secondchar = (unsigned char) STR [I + TPOs]; If (! (Secondchar & head) break;} If (TPOs> wordlen) {counter + = wordlen + 1; I ++ = wordlen ;}} return counter ;} bool beutf8 (const char * Str) {unsigned int igbk = countgbk (STR); unsigned int iutf8 = countutf8 (STR); If (iutf8> igbk) return true; return false ;}
Countutf8 and countgbk are used to calculate the number of characters in the text that conform to the utf8 encoding and gb2312 encoding methods. beutf8 is used to check which encoding method overwrites more characters, which character set does the text belong. Note that some gb2312 encoding methods conflict with utf8 encoding. For example, Chinese characters starting with C0 and C1 overlap with utf8 encoding methods, and all if (iutf8> igbk) return true; whether the statement has an equal sign is more commonly used in the text. If an equal sign is contained, for example, when the word "image" is recognized, the encoding is incorrectly recognized as utf8 encoding.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.