What is the difference between Unicode, UTF-8, and iso8859-1?
Will take "Chinese" two words as an example, by looking at the table can know its GB2312 code is "d6d0 CEC4", Unicode Encoding "4e2d 6587", UTF code is "E4b8ad e69687". Attention
These two words are not iso8859-1 encoded, but can be "represented" by iso8859-1 encoding.
2. Basic knowledge of coding
The earliest encoding is iso8859-1, similar to ASCII encoding. However, in order to facilitate the presentation of a variety of languages, there are a number of standard coding, the following are important.
2.1. Iso8859-1 is usually called Latin-1.
is a single-byte encoding and can represent a range of 0-255 characters, which is applied to the English series. For example, the letter A is encoded as 0x61=97.
It is clear that the iso8859-1 encoding represents a narrow character range and cannot represent Chinese characters. However, because it is a single-byte encoding, and the computer is the most basic representation unit consistent, so many times,
Still expressed using ISO8859-1 encoding. And on many protocols, the code is used by default. For example, although "Chinese" two words do not exist iso8859-1 encoding, take gb2312 encoding as an example, should
This is "d6d0 cec4" two characters, when using iso8859-1 encoding, it is opened to 4 bytes to represent: "D6 d0 ce C4" (in fact, in the case of storage, it is also in bytes
The unit is processed). In the case of UTF encoding, it is 6 bytes "E4 B8 ad E6 96 87". Obviously, this representation needs to be based on another encoding.
2.2. GB2312/GBK
This is the man's national standard Code, specifically used to denote Chinese characters, is a double-byte encoding, and the English letter and Iso8859-1 consistent (compatible with ISO8859-1 encoding). where GBK encoding can be used to simultaneously represent
Traditional and simplified characters, while gb2312 can only express simplified characters, GBK is compatible with GB2312 encoding.
2.3. Unicode
This is the most uniform encoding that can be used to represent all language characters, and is a fixed-length double-byte (also four-byte) encoding, including the English alphabet. So it can be said that it is incompatible iso8859-1
Code, nor is it compatible with any code. However, compared to the iso8859-1 encoding, the Uniocode encoding only adds a 0 byte to the front, such as the letter A is "00 61".
It is important to note that the fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so Unicode is used internally in many software
Code to handle, such as Java.
2.4. UTF
Given that Unicode encoding is incompatible with ISO8859-1 encoding, it is easy to take up more space: Because Unicode also requires two bytes for the English alphabet. So Unicode is not easy to transmit and store
Storage. As a result, UTF encoding, UTF encoding is compatible with ISO8859-1 encoding, can also be used to represent all language characters, however, UTF encoding is indefinite length encoding, each character length from 1-6 words
Sections. In addition, UTF code comes with a simple checksum function. In general, the English alphabet is expressed in one byte, while the characters use three bytes.
Note that although UTF is used in order to use less space, it is only relative to Unicode encoding, if you already know is kanji, then using GB2312/GBK is undoubtedly the most economical. But the other
On the one hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, UTF encoding is less than Unicode encoding even for kanji pages, because the page contains a lot of English characters.
3. java handling of characters
In Java applications, there will be multiple character set encoding, some places need to make the correct settings, some areas need to be a certain degree of processing.
3.1. GetBytes (CharSet)
This is a standard function of Java string processing, which is to encode the character represented by the string in charset and byte notation. Note that strings are always encoded in Java memory by Unicode
Stored. For example, "Chinese", normally (i.e. no error) is stored as "4e2d 6587", if CharSet is "GBK", it is encoded as "d6d0 CEC4" and then returns the byte "D6 d0 ce c4".
If CharSet is "UTF8" then the end Is "E4 B8 ad E6 96 87". If it is "iso8859-1", it will return "3f 3f" (two question marks) because it cannot be encoded.
3.2. New String (CharSet)
This is another standard function of Java string processing, and in contrast to the previous function, the byte array is identified by CharSet encoding and finally converted to Unicode storage. Refer to the above GetBytes
Example, "GBK" and "UTF8" can all draw the correct result "4e2d 6587", but Iso8859-1 finally becomes "003f 003f" (two question marks).
Because UTF8 can be used to represent/encode all characters, the new String (Str.getbytes ("UTF8"), "utf8") = = =-STR, which is completely reversible.
3.3. Setcharacterencoding ()
This function is used to set the HTTP request or the corresponding encoding.
For request, the encoding of the commit content, which can be obtained by getparameter () to obtain the correct string directly, or, if not specified, by default using ISO8859-1 encoding, which requires further processing.
See "Form input" below. It is important to note that no getparameter () can be performed until setcharacterencoding () is executed. Java Doc Description: This method must is called prior to reading request parameters or reading input using Getreader (). Furthermore, the designation is valid only for the Post method and not for the Get method. The reason for the analysis is that when the first getparameter () is executed, Java will parse all the submissions according to the encoding, and the subsequent getparameter () is no longer parsed, so setcharacterencoding () is invalid. In the case of the Get method submission form, the content submitted in the URL, the beginning of the analysis of all submissions according to the Code, setcharacterencoding () naturally invalid.
4.iso-8859-1 is the standard character set used in the Java Network transport, and gb2312 is the standard Chinese character set, when you make a submission form, such as the need for network transport operations, you need to convert iso-8859-1 to gb2312 character set display, Otherwise, if the iso-8859-1 character set is interpreted according to the browser's gb2312 format, it will be garbled because of the incompatibility of 2.
Finally, consider an example of encoding:
[Java]View Plaincopyprint?
- String s = "Hello";
- Coding
- byte[] utf = s.getbytes ("Utf-8");
- byte[] GBK = s.getbytes ("GBK");
- System.out.println ("Utf-8 code:" + arrays.tostring (UTF)); //[-28,-67,-96,-27,-91,-67] 6 bytes
- System.out.println ("GBK code:" + arrays.tostring (GBK)); //[ -60, -29, -70, -61]<span style= "white-space:pre" > </span>4 bytes
- Decoding
- string S1 = new String (UTF, "Utf-8"); //Hello
- String s2 = new String (UTF, "GBK"); GBK decoding: The raccoon à ソGBK is decoded with 2 bytes, so it will be one more character
- String s3 = New String (GBK, "Utf-8"); GBK decoding with Utf-8:??? <span style= "White-space:pre" > </span>utf-8 decoding requires 6 bytes
- System.out.println ("--------------------");
- System.out.println ("Utf-8 decoding:" + S1);
- System.out.println ("GBK decoding:" + s2);
- System.out.println ("GBK with Utf-8 decoding:" + S3);
- System.out.println ("---------------------");
- System.out.println ("encode back with Utf-8");
- S3 = New String (S3.getbytes ("Utf-8"), "GBK"); //锟 Jin 锟? GBK can not be programmed back after decoding with Utf-8
- System.out.println (S3);
Law:
Utf-8 encoding can be decoded with GBK and iso8859-1 and then programmed back
GBK encoding can only be decoded with Iso8859-1 after the compilation back
There are two workarounds for getting the value of a form in a JSP page garbled:
One is to set the character encoding by Request.setcharacterencoding before calling GetParameter, and the other is to call new String (Str.getbytes ("iso8859-1"), "UTF-8"); Encoded and decoded, both methods can get the correct results
Encoding and decoding between Gbk,utf-8, and iso8859-1