Determine whether a file is encoded in UTF-8 or gb2312.

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It is very easy to determine the encoding method in texts that only contain Chinese and English characters. The most common encoding method for Chinese characters is GBK, and the larger character set, such as GBK, is backward compatible with gb2312, many of these characters are not used in daily life. Therefore, we generally only need to distinguish between gb2312 and utf8 encoding. Here I only provide a feasible method. If you determine GBK, you can use a similar method to analyze the encoding method of Chinese Characters in gb2312, gb2312 uses dual-byte encoding for Chinese characters. The first byte is 161 ~ 247, second byte 161 ~ 254, which contains the boundary condition. The UTF-8 encoding method can be described as follows:

Code Scope Hexadecimal	Scalar Value) Binary	UTF-8 Binary/hexadecimal	Note
000000-00007f 128 Codes	00000000 00000000 0 zzzzzzz	0 zzzzzzz (00-7f)	ASCII character range, starting from zero
000000-00007f 128 Codes	Seven Z	Seven Z	ASCII character range, starting from zero
000080-0007ff 1920 Codes	00000000 00000yyy yyzzzzzz	110 yyyyy (C0-DF) 10 Zzzzzz (80-bf)	The first byte starts from 110, And the next byte starts from 10.
000080-0007ff 1920 Codes	Three y; two Y; six Z	Five y; six Z
000800-00d7ff 00e000-00 FFFF 61440 Codes^{[NOTE 1]}	00000000 xxxxyyyy yyzzzzzz	1110 xxxx (E0-EF) 10 yyyyyy 10 Zzzzzz	The first byte starts from 1110, And the next byte starts from 10.
000800-00d7ff 00e000-00 FFFF 61440 Codes^{[NOTE 1]}	Four x; four Y; two Y; six Z	Four x; six Y; six Z
010000-10 FFFF 1048576 Codes	000 wwwxx xxxxyyyy yyzzzzzz	11110www (F0-F7) 10 xxxxxx 10 yyyyyy 10 zzzzzzzz	Starts from 11110, And the next byte starts from 10.

In this way, we can identify gb2312 and utf8 by differences in encoding methods. The following code is provided:

Unsigned int countgbk (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar, secondchar; for (unsigned int I = 0; I <len-1; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; secondchar = (unsigned char) STR [I]; if (firstchar> = 161 & firstchar <= 247 & secondchar> = 161 & secondchar <= 254) {counter ++ 2; ++ I ;}} return counter;} unsigned int countutf8 (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar; for (unsigned int I = 0; I <Len; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; unsigned char tmphead = head; unsigned int wordlen = 0, TPOs = 0; while (firstchar & tmphead) {++ wordlen; tmphead >>=1;} If (wordlen <= 1) continue; // The minimum utf8 length is 2 wordlen --; If (wordlen + I> = Len) break; for (TPOs = 1; TPOs <= wordlen; ++ TPOs) {unsigned char secondchar = (unsigned char) STR [I + TPOs]; If (! (Secondchar & head) break;} If (TPOs> wordlen) {counter + = wordlen + 1; I ++ = wordlen ;}} return counter ;} bool beutf8 (const char * Str) {unsigned int igbk = countgbk (STR); unsigned int iutf8 = countutf8 (STR); If (iutf8> igbk) return true; return false ;}

Countutf8 and countgbk are used to calculate the number of characters in the text that conform to the utf8 encoding and gb2312 encoding methods. beutf8 is used to check which encoding method overwrites more characters, which character set does the text belong. Note that some gb2312 encoding methods conflict with utf8 encoding. For example, Chinese characters starting with C0 and C1 overlap with utf8 encoding methods, and all if (iutf8> igbk) return true; whether the statement has an equal sign is more commonly used in the text. If an equal sign is contained, for example, when the word "image" is recognized, the encoding is incorrectly recognized as utf8 encoding.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Determine whether a file is encoded in UTF-8 or gb2312.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Determine whether a file is encoded in UTF-8 or gb2312.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support