It is very easy to determine the encoding method in texts that only contain Chinese and English characters. The most common encoding method for Chinese characters is GBK, and the larger character set, such as GBK, is backward compatible with gb2312, many of these characters are not used in daily life. Therefore, we generally only need to distinguish between gb2312 and utf8 encoding. Here I only provide a feasible method. If you determine GBK, you can use a similar method to analyze the encoding method of Chinese Characters in gb2312, gb2312 uses dual-byte encoding for Chinese characters. The first byte is 161 ~ 247, second byte 161 ~ 254, which contains the boundary condition. The UTF-8 encoding method can be described as follows:
Code Scope Hexadecimal |
Scalar Value) Binary |
UTF-8 Binary/hexadecimal |
Note |
000000-00007f 128 Codes |
00000000 00000000 0 zzzzzzz |
0 zzzzzzz (00-7f) |
ASCII character range, starting from zero |
Seven Z |
Seven Z |
000080-0007ff 1920 Codes |
00000000 00000yyy yyzzzzzz |
110 yyyyy (C0-DF) 10 Zzzzzz (80-bf) |
The first byte starts from 110, And the next byte starts from 10. |
Three y; two Y; six Z |
Five y; six Z |
000800-00d7ff 00e000-00 FFFF 61440 Codes[NOTE 1] |
00000000 xxxxyyyy yyzzzzzz |
1110 xxxx (E0-EF) 10 yyyyyy 10 Zzzzzz |
The first byte starts from 1110, And the next byte starts from 10. |
Four x; four Y; two Y; six Z |
Four x; six Y; six Z |
010000-10 FFFF 1048576 Codes |
000 wwwxx xxxxyyyy yyzzzzzz |
11110www (F0-F7) 10 xxxxxx 10 yyyyyy 10 zzzzzzzz |
Starts from 11110, And the next byte starts from 10. |
In this way, we can identify gb2312 and utf8 by differences in encoding methods. The following code is provided:
Unsigned int countgbk (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar, secondchar; for (unsigned int I = 0; I <len-1; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; secondchar = (unsigned char) STR [I]; if (firstchar> = 161 & firstchar <= 247 & secondchar> = 161 & secondchar <= 254) {counter ++ 2; ++ I ;}} return counter;} unsigned int countutf8 (const char * Str) {assert (STR! = NULL); unsigned int Len = (unsigned INT) strlen (STR); unsigned int counter = 0; unsigned char head = 0x80; unsigned char firstchar; for (unsigned int I = 0; I <Len; ++ I) {firstchar = (unsigned char) STR [I]; If (! (Firstchar & head) continue; unsigned char tmphead = head; unsigned int wordlen = 0, TPOs = 0; while (firstchar & tmphead) {++ wordlen; tmphead >>=1;} If (wordlen <= 1) continue; // The minimum utf8 length is 2 wordlen --; If (wordlen + I> = Len) break; for (TPOs = 1; TPOs <= wordlen; ++ TPOs) {unsigned char secondchar = (unsigned char) STR [I + TPOs]; If (! (Secondchar & head) break;} If (TPOs> wordlen) {counter + = wordlen + 1; I ++ = wordlen ;}} return counter ;} bool beutf8 (const char * Str) {unsigned int igbk = countgbk (STR); unsigned int iutf8 = countutf8 (STR); If (iutf8> igbk) return true; return false ;}
Countutf8 and countgbk are used to calculate the number of characters in the text that conform to the utf8 encoding and gb2312 encoding methods. beutf8 is used to check which encoding method overwrites more characters, which character set does the text belong. Note that some gb2312 encoding methods conflict with utf8 encoding. For example, Chinese characters starting with C0 and C1 overlap with utf8 encoding methods, and all if (iutf8> igbk) return true; whether the statement has an equal sign is more commonly used in the text. If an equal sign is contained, for example, when the word "image" is recognized, the encoding is incorrectly recognized as utf8 encoding.