Background Basics:
1, character encoding related knowledge (transfer from http://blog.csdn.net/llwan/article/details/7567906)
1.1, "character" is represented by a number
First to re-understand how the computer handles the "character", this principle is everyone must remember, especially in the Java writing program, it is absolutely not blurred. We know that computers use numbers to denote everything, and "character" is no exception. For example, we want to display an Arabic numeral "3", in our PC, in fact, is not just a number to represent the "3" we want to write, but to the 16-0x33 to represent, including in memory or write to the file, in fact, are written 0x33, do not believe you can edit a text file, Write a "3" and then use UltraEdit to see his original code.
1.2, all "characters" must be represented by the number + encoding table.
At this time, there is a question: why must use 0X33 to represent "3"? Instead of 0x43 to represent it? Or is it a direct substitute for 0x03? In fact, with what to represent all can, but everyone is accustomed to using ASCII encoding table (is the United States National Information exchange table) to determine what the number of characters should be represented. Similarly, in order to express Chinese characters, China has also designated a Chinese encoding table, of which the most widely used is GB2312. For example, the Chinese "when" word, is to use 0xb5, 0xb1 the two eight-digit numbers to express. So if the program that displays the characters doesn't know what coding table the numbers are encoded on, he can't judge exactly what these words are. If you use a wrong coding table to handle these numbers, the characters that are processed are probably completely wrong. For example, in the English system, there is no GB2312 encoding table, give him a 0xb5,0xb1, he foolishly as ASCII to deal with (the operating system usually has its own default encoding table), the results show is two strange symbols, because the two words in the ASCII table is the two symbols. Also in the traditional Chinese system, his coding table is BIG5, showing it is also a strange Chinese, not "when" word.
1.3. Unicode makes the world speak a language
After reading the above text, do you feel that the world has so many languages, each has its own set of code table, very troublesome? Even in Chinese, there are two sets of popular coding tables, one is GB2312, the other is BIG5. To use different Chinese coded characters, it's really troublesome to go around. Not only this, if you want to write an article containing a lot of over-the-country text, it is troublesome, must let the process of handling this article know, which Word is what coding standards. If you want to find a word in the article, you must also specify which encoding you are looking for. Otherwise, you need to find a 0xb5,0xb1 Chinese "when" word, it is likely to be the same number of Japanese, Polish these unrelated words to you find out, enough trouble!
So people think, as we all use the same coding standard bar, all kinds of text in the coding table has a place, processing the text of the program only need to press the code sheet to deal with it. But to an encoding table contains all the text, this table is large, originally English text + number is only 128 or less. But with the Chinese, suddenly there are tens of thousands of more, so the size of a character needs to be a lot larger. Now Unicode specifies that a character must be represented by 2 8-digit numbers (that is, a hexadecimal number), think of 8x8x8x8x = 65536, how big a number Ah! so all the words in the world are included. Of course, some people say that Chinese characters may be more than 60,000 pull, but also to include other words, but others think you Chinese people often do not have so much, so decided, we have no way. It is important to note that although both GB2312 and Unicode use two 8-digit numbers to represent a Chinese text, the specific specifications are different, for example, 0XB5,0XB1 in Unicode is not a "when" word, but another country's text.
1.4, Java is how to handle the characters.
The world will always progress, Java is an example. Java finally has the string class, which is the best tool for solving character problems. In Java, a basic point is: The String class object is not required to specify the encoding table! Why does it know what character a bunch of numbers represent? This is because the character information in string is stored in Unicode encoding. and Java in order to represent the character (note is a single character), there is a char this data type, and his size is fixed 2 8-bit 16 binary number length, that is, 0~65535 ROM. is one of the characters in the corresponding Unicode. If you want to take a string in the Unicode number, you can use GetChars (int srcbegin, int srcend, char[] DST, int dstbegin) method to obtain a char[], this char[] is a number that represents a string character encoded by a Unicode-encoded table.
Unfortunately, most systems and programs do not process characters by Unicode, and Java programs always exchange data with other programs and systems, so when you receive a character or send a character, you have to be aware of the current system and Unicode relationships. For example, you receive a number from the network or file: The 0xb5,0xb1,java program does not know whether the word is Chinese? It's Japanese, or English. If you do not specify the two-digit encoding table, Java will be processed by the current system's default encoding table. If these two numbers are sent from the Chinese WIN98, the Java program is also running on the English Linux, then there is the so-called garbled problem. That is, Java in English encoding table ASCII to deal with these two numbers, when the string obtained through the new string ({0XB5,0XB1}), this string is not the Chinese "as" word, but the two strange characters in English. However, if you know that these two numbers must be in Chinese, you can specify the new string ({0XB5,0XB1}, "GB2312") to handle, when the newly created string is really a "when" word. Of course pull, if you want to put a "when" character of the Java string displayed on the Chinese WIN98, the word must be output to two 8-digit number: 0XB5,0XB1, whether written as a file or output to the browser, must be 0XB5,0XB1. How to use the "when" word with GB2312 output? String.getbytes ("GB2312") can pull! So one thing to remember: exchanging any information with the outside world is done with byte[]. You can look at the majority of Java I/O classes, all with byte[] as parameters and return values. However, there are a lot of confused writing procedures, did not provide byte[] exchange of information methods, harm the different text platform programmers are very headache. This is the Httprequest.getparameter () of the servlet. Fortunately, some jsp/servlet is easy to provide a way to specify the encoding table first, in order to solve this problem relatively simple.
1.5. Some error handling methods on the internet about the Chinese problem of Java.
One is the most common, regardless of what content, use new string (..., "iso-8859-1") to build the string, and then use the time in the default encoding format (usually on the server is the English system) output string. In fact, the string you are using does not use Unicode to represent the real character, but instead forcibly copies the byte array into the char[of string], and once your environment changes, you are forced to modify a lot of code. And it is not possible to process several different coded literals in the same string.
The other is to convert a string of encoded format, such as GB2312, into another format string, such as UTF-8, and then do not specify UTF-8 encoding, but directly with the new string (...). To create a string so that the character placed inside the string is not deterministic, and it represents different characters on different systems. If you ask someone to exchange information using a "UTF-8 format" string, it actually destroys the Java rules for compatibility with various languages. The essence of this mistake is also to write the C language, the string purely as a free-coded memory to use, and ignore the Java string only one encoding format. If you really want to encode freely, use byte[] or char[] to solve the problem completely.
1.6, Summary (personal Summary, non-reproduced)
1.6.1, the understanding of character encoding
Character encoding is something that parses binary data into language characters that we use everyday, such as 0100 0001 (binary) that represents a character under ASCII rules, but does not represent this character under Unicode rules
1.6.2, the cause of garbled situation
1.6.2.1, 2 binary incorrect (getBytes (str, "Gb2312/unicode ...")) getBytes (str) using the system default encoding method
If you will Unicode encoding the Chinese character "I" into a binary representation, but the use of GB2312 encoding rules, this time to get the binary is not what you want, until you use the wrong binary in accordance with the Unicode encoding rules obtained will not be "I", there is garbled
1.6.2.2, encoding is incorrect (new string (bytes, "Gb2312/unicode ...")) New string (bytes) uses the system default encoding method
If you use the new String ({0XB5,0XB1}), this time J because you do not specify the encoding method, Java will use the current system default encoding method for processing, this time because 0xb5,0xb1 in the way you want to encode the correct character, However, if the default encoding method is not necessarily the result you want, so there is garbled
1.6.3, some tests
1.6.3.1Java string consists of char, from the function charat (int index) in string
The 1.6.3.2.String class holds Unicode encoding, each char is 16 bits, a char may be a traditional ASCII character, or it may be a Chinese character, which occupies two bytes in memory
Relationship between the 1.6.3.3.String class and byte[]
string s = new string ("Sorrow of Frost");
byte[] array1 = s.getbytes ("Utf-8");
byte[] Array2 = s.getbytes ("GBK");
byte[] Array3 = s.getbytes ("Unicode");
byte[] Array4 = S.getbytes ();//default is GBK
Printbytes (array1);
Printbytes (array2);
Printbytes (ARRAY3);
Printbytes (ARRAY4);
Program output:
0xe9 0x9c 0x9c 0xE4 0xb9 0x8b 0xe5 0x93 0x80 0xE4 0xBC 0xa4
0xCB 0xAA 0xd6 0xAE 0xb0 0xa7 0xc9 0xCB
0xFE 0xFF 0x97 0x1C 0x4E 0x4B 0x54 0xC0 0x4f 0x24
0xCB 0xAA 0xd6 0xAE 0xb0 0xa7 0xc9 0xCB
2, Hex (refer to https://my.oschina.net/xinxingegeya/blog/287476)
Hex (abbreviated as HEX or subscript 16) in mathematics is a 16-in-1 carry system, generally with the numbers 0 to 9 and the letters A to F (where: A~f is 10~15).
For example decimal number 57, in binary writing 111001, in 16 binary writing 39.
Languages like Java,c in order to distinguish between hexadecimal and decimal values, the hexadecimal number is preceded by a 0x, such as 0x20 is a decimal 32 instead of a decimal 20
Hexadecimal characters are represented by 4 bits.
3. byte[] and hexadecimal strings in Java are converted to each other
In Java, Byte uses a binary representation of 8 bits, and we know that each character in the 16 binary needs to be represented by a 4-bit bits.
So we can convert each byte to two corresponding 16 characters, that is, the high 4 and low 4 bits of byte are converted to the corresponding 16-character H and L, and combined to get the result of byte conversion to the 16 binary string.
New String (H) + new String (L).
In the same way, the opposite conversion also converts two 16 characters into a byte, as in the same principle.
Based on the above, we can convert the byte[] array to a 16 binary string, and of course we can convert the 16 binary string to the byte[] array.
/** Convert byte[] to hex string. Here we can convert a byte to int, and then use integer.tohexstring (int) * To convert it to a 16 binary string. * @param src byte[] Data * @return Hex string*/ Public StaticString bytestohexstring (byte[] src) {StringBuilder StringBuilder=NewStringBuilder (""); if(src = =NULL|| Src.length <= 0) { return NULL; } for(inti = 0; i < src.length; i++) { intv = src[i] & 0xFF; String HV=integer.tohexstring (v); if(Hv.length () < 2) {stringbuilder.append (0); } stringbuilder.append (HV); } returnstringbuilder.tostring (); }
/*** hexadecimal string converted to byte array * **/ Public Static byte[] Conver16hextobyte (String hex16str) {Char[] C =Hex16str.tochararray (); byte[] B =New byte[C.LENGTH/2]; for(inti = 0;i<b.length;i++) { intpos = i * 2; B[i]= (byte) ("0123456789ABCDEF". IndexOf (C[pos]) << 4 | "0123456789ABCDEF". IndexOf (c[pos+1])); } returnb; }
Conversion of Java byte array and 16 binary strings to each other