Character
Characters are older than computers, and from the time of human civilization, people use symbols to represent the world. such as ABC, such as "one, two or three".
Character
A character set is a collection of all characters.
XXX Character Set
Sets an ordinal character set for each character in a character set. The common XXX character set has the ASCLL character set, the Unicode character set, and so on, and the different character sets are numbered differently for each character and contain a different number of characters.
GBK, UTF-8
GBK, UTF-8 is a coded encoding format. Of course, you can also say that Unicode is an encoding format, because it does make a code for each character, yes, but Unicode encoding is completely irregular, can only be used as a mapping table.
We know that computers can only recognize 1 and 0, if the computer stores the characters "word" on the hard disk, it is certainly a string of binary strings stored.
So the question comes, the Chinese characters "word" in the Unicode character set ordinal is 23383, then direct 23383 into 2 into 101101101010111, and then stored in the computer, and so on when necessary to take out 101101101010111 strings, Turn to 23383, and then according to the Unicode mapping table, find the Chinese characters "word" is not good?
The answer is no, and if so, how many of the computers know 1 or 0 to represent a character? So we need an encoding format that encodes 23383 into regular 1 and 0 strings for the computer to read.
and GBK and UTF-8 are two different kinds of coded formats.
For example, in the case of UTF-8, if our environment is using a Unicode character set, then the ordinal of "word" in the Unicode character set is 23383, the binary is 101101101010111, and UTF-8 is used to encode it. With a specific algorithm (which is specific to this algorithm), the 101101101010111 is converted to 11100101 10101101 100,101,113 bytes of binary string, and then stored to the hard disk, when the computer reads, If we specify that the computer be read and decoded in UTF-8 encoded format, the computer will take these three bytes out and turn back backwards to get the Chinese character of the word.
The source of garbled characters:
If we store the time, using GBK encoding format encoding, storage to the hard disk, and read from the hard disk, in the " reverse back " This step is to use the UTF-8 encoding format to go back, the algorithm is different, Then there may be garbled characters.
How to avoid garbled characters:
in what encoding format to store, the solution with what encoding format .
However, if User a uses GBK encoding for "word" encoding, and User B does not know, nor a contact way, with a convention can not know the hard disk data is encoded in what format, how to do?
The idea of solving garbled characters:
1, arbitrary use of a coding format decoding, see if the decoded string is garbled, if it is garbled, using another encoding format decoding. But the method may be misjudged.
2, UTF-8 encoding format has a certain regularity, we can use the regular expression to verify whether it is after the UTF-8 encoding.
Java Self-Test garbled
1 Boolean b = Java.nio.charset.Charset.forName ("GBK"). Newencoder (). Canencode (str);
When the beginning of contact with this method, the original thought Java can help us to judge garbled, we can rest easy, and later found that the success rate of the method is not high.
But we can first use this method to do the first step detection , if not to judge, then use the 2nd method.
Coding Law of UTF-8
UTF-8 in the form of binary, when a byte, two bytes, three or four, five or six bytes, there is a certain format:
1 bytes |
0xxxxxxx |
2 bytes |
110xxxxx 10xxxxxx |
3 bytes |
1110xxxx 10xxxxxx 10xxxxxx |
4 bytes |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
5 bytes |
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
6 bytes |
111111x0 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Obviously, the first byte is different, so the first byte is used to indicate how many bytes the character occupies.
When a computer reads a byte that begins with 0xxxxxxx, it means that the byte is already represented by itself, and the computer will decode the byte alone.
When a computer reads a byte that begins with 110xxxxx, it represents two bytes to represent a character, and the computer takes the byte and a byte after it to decode it for a character.
......
In addition to the first byte, the subsequent bytes are in a uniform 10xxxxxx format.
With the regular format above, we can use regular expressions to detect whether a binary string is UTF-8 encoded string, but the code is not convenient to operate the binary system, combined with the URL is 16 binary features, we can use regular expressions to determine the 16 binary string.
How to construct a regular expression
Let's first look at the range of the previous byte in this encoding format:
|
Binary |
Hexadecimal |
1 bytes |
00000000~01111111 |
00~7f |
2 bytes |
11000000~11011111 |
C0~df |
3 bytes |
11100000~11101111 |
E0~df |
4 bytes |
11110000~11110111 |
F0~f7 |
5 bytes |
11111000~11111011 |
F8~fb |
6 bytes |
11111100~11111101 |
Fc~fd |
The above range can be verified by the computer itself:
The following format is the same as the range of byte 10xxxxxx:
In this format, the UTF-8 encoding format can be used to represent a 1+5*6=31-bit binary string, with a total of 6 bytes.
In accordance with this rule, we first try to turn the word into a UTF-8 16 binary:
The character set used by Java is Unicode, so we use Unicode as an example.
1. Find the ordinal number of "word" in the Unicdoe character set:
public static void main(String[] args) {
System.out.println((int)‘字‘);
}
The result is: 23383
2. Convert the 23383 into binary:
As can be seen, the binary is a total of 15 bits , in accordance with the UTF-8 encoding format, 3 bytes to be represented.
We split the 101101101010111 from the back into three groups: 101,101101,010111
The UTF-8 encoded format that is populated to 3 bytes is:
1110xxxx 10xxxxxx 10xxxxxx
11100101 101101 010111
3, using the calculator to convert the binary into 16 binary:
OxE5 Oxad Ox97
4, the use of online tools to verify that the results coincide, indicating that the law is correct .
The above has introduced the law of the UTF-8, then we can rely on the powerful regular expression , it is possible to determine the URL string is encoded in what format.
It is easy to observe the above table before copying it:
|
Binary |
Hexadecimal |
1 bytes |
00000000~01111111 |
00~7f |
2 bytes |
11000000~11011111 |
C0~df |
3 bytes |
11100000~11101111 |
E0~df |
4 bytes |
11110000~11110111 |
F0~f7 |
5 bytes |
11111000~11111011 |
F8~fb |
6 bytes |
11111100~11111101 |
Fc~fd |
1 bytes: [\\x00-\\x7f]---------------------------------1
2 bytes: [\\XC0-\\XDF][\\X80-\\XBF]-------------------2
3 bytes: [\\xe0-\\xef][\\x80-\\xbf]{2}--------------3
4 bytes: [\\xf0-\\xf7][\\x80-\\xbf]{3}--------------4
5 bytes: [\\xf8-\\xfb][\\x80-\\xbf]{4}--------------5
6 bytes: [\\xfc-\\xfd][\\x80-\\xbf]{5}--------------6
To use or group together is: ^ ([\\x00-\\x7f]| [\\XC0-\\XDF] [\\X80-\\XBF] | [\\xe0-\\xef] [\\X80-\\XBF] {2} | [\\xf0-\\xf7] [\\X80-\\XBF] {3} | [\\XF8-\\XFB] [\\X80-\\XBF] {4} | [\\XFC-\\XFD] [\\X80-\\XBF] {5}) +$
The judgment process is this : for example, "word" after UTF-8 encoding, is:%e5%ad% 97, a total of 3 bytes, in accordance with the 3rd byte case, the first byte E5 in the [\\xe0-\\xef] range, The latter two bytes of AD and 97 are all within the [\\X80-\\XBF] range.
So we can say that this character is UTF-8 encoded. We can decode it using the UTF-8 encoding format.
The Java code is as follows:
1 protected static final Pattern utf8Pattern = Pattern.compile("^([\\x00-\\x7f]|[\\xc0-\\xdf][\\x80-\\xbf]|[\\xe0-\\xef][\\x80-\\xbf]{2}|[\\xf0-\\xf7][\\x80-\\xbf]{3}|[\\xf8-\\xfb][\\x80-\\xbf]{4}|[\\xfc-\\xfd][\\x80-\\xbf]{5})+$");
2 Matcher matcher = utf8Pattern.matcher(pureValue);
3 if (matcher.matches()) {
4 return "UTF-8";
5 } else {
6 return "GBK";
7 }
Defects
Using the above method, it seems that there is no problem, but the GBK encoded by two two bytes rendered, and UTF-8 also has two bytes, so when a word Fu Ching GBK encoded, converted to 16, and just this 16-based range falls into the UTF-8 of two bytes of the range, Then it will be misjudged into UTF-8, resulting in decoding errors. Is it really possible that this is going to happen?
The answer is yes, we look at the GBK Simplified Chinese encoding table.
It is found that a part of the range falls into the UTF-8 binary range.
From:
Always to:
That is, the range of two bytes in hexadecimal UTF-8 [\\XC0-\\XDF][\\X80-\\XBF],GBK.
For example, the second Chinese "unworthy" of the table above, the unworthy GBK 16 binary is C0 A0, then fully conforms to the UTF-8 regular expression in the two-byte [\\XC0-\\XDF][\\X80-\\XBF] This judgment, so will be mistaken for UTF-8 code.
Note: This flaw for the first time, is in the bottom of the "reference" in the first blog, tried a bit, indeed flawed.
Try to fix the bug
According to the first blog of the "reference" below, the idea is to put the duplicate areas as GBK encoded .
We intercept the first two cases of the regular expression (one-byte, two-byte case) to exclude: ^ ([\\x01-\\x7f]| [\\XC0-\\XDF] [\\X80-\\XBF]) +$
If a 16 binary string match the regular expression, it is considered to be GBK encoded.
The modified code is:
1 protected static final Pattern utf8Pattern = Pattern.compile("^([\\x01-\\x7f]|[\\xc0-\\xdf][\\x80-\\xbf]|[\\xe0-\\xef][\\x80-\\xbf]{2}|[\\xf0-\\xf7][\\x80-\\xbf]{3}|[\\xf8-\\xfb][\\x80-\\xbf]{4}|[\\xfc-\\xfd][\\x80-\\xbf]{5})+$");
2 protected static final Pattern publicPattern = Pattern.compile("^([\\x01-\\x7f]|[\\xc0-\\xdf][\\x80-\\xbf])+$");
3 Matcher publicMatcher = publicPattern.matcher(str);
4 if(publicMatcher.matches()) {
5 return "GBK";
6 }
7
8 Matcher matcher = utf8Pattern.matcher(str);
9 if (matcher.matches()) {
10 return "UTF-8";
11 } else {
12 return "GBK";
13 }
Another flaw
However, it is originally a byte or two bytes, and is UTF-8 encoded, it will be misjudged as GBK ...
However, this is better than being misjudged as UTF-8 because we look at the Unicode encoding table:
Can be found that the first Chinese is "one", converted to UTF-8 has been queued to 3 bytes, so 2 bytes will not appear in Chinese.
But in GBK, Chinese is two bytes.
Therefore, the use of the above method of repairing defects, can ensure that the Chinese will not garbled . For some sites, just ensure that the Chinese will not garbled, such as the domestic Chinese shopping website . The titles of the products in these websites are generally Chinese, users generally search in Chinese, we can ensure that Chinese is not garbled as far as possible.
Therefore, the technology has a certain usefulness.
Reference
1, http://www.cnblogs.com/chengmo/archive/2011/02/19/1958657.html
2, http://www.cnblogs.com/chengmo/archive/2010/10/30/1864004.html
3. Unicode Encoding Table
4, GBK Simplified Chinese table
If there is a mistake, please note.
Farewell garbled, for GBK, UTF-8 Two encoding intelligent URL Decoder Java implementation