Java Character Set

Last Update:2017-08-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, the single character in the JVM occupies the length of the byte is related to the encoding method, and the default encoding method and the platform is one by one corresponding or platform determines the default character encoding method;

2, for a single character: Iso-8859-1 single-byte encoding, GBK double-byte encoding, UTF-8 three- byte encoding, so the Chinese platform (Chinese platform default character set encoding GBK) The next medium character is 2 bytes, and the English platform ( English platform default character set encoding Cp1252 (similar to iso-8859-1)).

The functionof 3, Getbytes (), getBytes (encoding) is to encode a string into a byte array using the system default or a specified character set encoding.

The encoding determines the length of the byte; under the Chinese platform, the default character set encoding is GBK, and if you use GetBytes () or GetBytes ("GBK"), each Chinese character is represented by a GBK encoding rule with 2 byte . so we see "Chinese" final GBK encoding result is: -42-48-50-60. 42 and 48 represent the word "medium", while "50" and "60" represent the word "text".

In the Chinese platform, if the specified character set encoding is UTF-8, then according to the encoding rules of the Chinese language UTF-8: Each Chinese with 3 bytes, then the "Chinese" two characters will eventually be encoded as: -28-72-83, -26-106-121 two groups. Each 3 bytes represents a Chinese character.

In the Chinese platform, if the specified character set encoding is iso-8859-1, because this character set is a single byte encoding, when using GetBytes ("Iso-8859-1"), each character takes only one byte, and each Chinese character takes only half of the characters. The other half of the bytes are missing. Since this half of the characters cannot find the corresponding character in the character set, the default is to use the Code 63 instead, that is?.

in the English platform, the default character set encoding is Cp1252 (similar to iso-8859-1), if encoded using GBK, UTF-8, the resulting byte array is still correct (GBK4 bytes, UTF-8 is 6 bytes). Because strings are stored in Unicode inside the JVM, using GetBytes (encoding) causes the JVM to convert between Unicode and the specified encoding at one time. for GBK,JVM will still be converted to 4 bytes, for UTF-8,JVM will still be converted to 6 bytes. However, for Iso-8859-1, the converted result is wrong because it cannot be converted (2 bytes--->1 bytes, and half of the bytes are intercepted).

Under the Chinese platform, the default character set encoding is GBK, so what does content.getbytes () get? This is the following 4 bytes:

Byte[0] = -42 Hex string = ffffffd6

BYTE[1] = -48 Hex string = ffffffd0

BYTE[2] = -50 Hex string = FFFFFFCE

BYTE[3] = -60 Hex string = ffffffc4

If the new encoding is GBK, it is decoded, because a character is represented by 2 bytes. So the end result is:

Char[0]= '---byte[0] + byte[1]

char[1]= ' Wen '---byte[2] + byte[3]

If the new encoding is iso-8859-1, then after decoding, because a character is represented by 1 bytes, so the original should be 2 bytes resolved together into a single byte parsing, each byte represents a character half of a character. This half of the bytes in the iso-8859-1 cannot find the corresponding character, it becomes the "?" , the final result:

char[0]= '? '----byte[0]

char[1]= '? '----byte[1]

char[2]= '? '----byte[2]

char[3]= '? '----byte[3]

If the new encoding is UTF-8, then after decoding, because a character with 3 bytes, so that the original 4 bytes of data can not be parsed into UTF-8 data, the final result is every one becomes "?".

char[0]= '? '----byte[0]

char[1]= '? '----byte[1]

char[2]= '? '----byte[2]

char[3]= '? '----byte[3]

If it is in the English platform, because the default encoding method is Cp1252, so content.getbytes () the bytes are truncated half of the remaining characters, so we see in the English platform, whether the designated encoding is GBK, UTF-8, The results are the same as the iso-8859-1.

Remember: This method proves again the danger of the GetBytes () method of string, and if we re-encode the string using new String (Str.getbytes (), encoding), We have to be clear about the length of the byte array returned by the Str.getbytes () method, what the content is, because Java does not automatically extend the byte array to accommodate the new encoding when the next codec is encoded with the new encoding. Instead, the byte array is parsed directly according to the new encoding method. So the result is like the example above, which is also 4 primitive bytes, some parsing each of the 2 groups, some parsing each group, and some parsing each of the 3 groups. The result can only be seen in the way that coding is appropriate.

Conclusion: under the same platform, the same Chinese characters, in different encoding mode, get a completely different byte array. These byte arrays may be correct (as long as the character set supports Chinese), or it may be completely wrong (the character set does not support Chinese).

Remember: Do not easily use or misuse the GetBytes (encoding) method of the string class, but try to avoid using the GetBytes () method. Because this approach is platform-dependent, it is possible to get different results when the platform is unpredictable. If byte encoding is a must, the user must ensure that the encoding method is the encoding of the string input.

———————————————————————————————————————————————————

unlike GetBytes (encoding), ToCharArray () returns "Natural characters". However, the number and content of this "natural character" is determined by the original encoding method .

FileWriter is a character stream output stream, and OutputStreamWriter is a byte stream output stream under the Chinese platform, and if you use FileWriter, no matter how you set character sets will not work. Because it is using the default system character set. Even if you set the System.setproperty ("file.encoding", "iso-8859-1"), or give the parameter-dfile.encoding=utf-8 at run time, it will not work. You will find that it is eventually saved in a "GB2312" or "GBK" way.

In the Chinese platform, if you use OutputStreamWriter, the character stream will be converted to a byte flow when writing in the background, and the specified encoding character set will work. You can see that in the case of specifying GBK, UTF-8, Chinese can be saved and read normally, and the file is saved in the way we have given. And for the iso-8859-1 to become, this again proves that the use of iso-8859-1 is not able to save Chinese, but also because the Chinese code in the Iso-8859-1 code can not find the corresponding character and the default conversion to?

Under the English platform, if you use filewriter, no matter how you set up the character set, the same will not work. All the files will be saved according to ISO-8859-1 encoding, without a doubt become. In the English platform, if you use OutputStreamWriter, only if we set the character and file encoding correctly to GBK, UTF-8, the Chinese can be correctly saved and displayed.

The above experiment proves that the Chinese can be parsed, saved and read correctly in order to ensure the client input in different platforms. The best way to do this is to use OutputStreamWriter with UTF-8 encoding. If you do not want to use UTF-8 encoding, consider using GB2312, and do not recommend using GBK, GB18030. Because for some old-fashioned text editors, GBK, GB18030 encoding is not even supported, but it is supported for GB2312. Because the first two are not GB, but the latter is.

The three methods of string GetBytes (), getBytes (encoding) and new string (bytes, encoding) are noteworthy: A.getbytes (): Using the default encoding of the platform ( Convert a string to byte[] by using the File.encoding Property Get) method. The resulting string is the most primitive byte-coded value.

B.getbytes (name_of_charset): Use the specified encoding to convert the string to byte[], if you want to get the correct byte array, the programmer must give the correct Name_of_charset. Otherwise, you will not get the correct result.

C.new String (bytes, encoding): If we make a request at the client using a UTF-8 encoded JSP page, the browser-encoded UTF-8 bytes are passed to the server side in iso-8859-1 form. So to get the raw bytes transmitted by the HTTP protocol, we need to call GetBytes ("Iso-8859-1") to get the original bytes, but because our client's original encoding is UTF-8, if we continue to decode according to Iso-8859-1, then the resulting will not be a Chinese character , but 3 characters that are garbled. So we need to call new String again (bytes, "UTF-8") to decode the byte array in UTF-8 format, every 3 groups, in order to revert to the original character of the client.

d.string's GetBytes (), GetBytes (Name_of_charset) methods are a subtle approach, in principle: What encoding is used for transmission, We need to follow this code to get the bytes. New String (bytes, name_of_charset) is more careful, in principle: What encoding the client uses, then the name_of_charset here must be consistent with the client. For example, the JSP page is GBK, then we must use the new String ("Parameter.getbytes" when we receive the parameters passed by the page Iso-8859-1 ")," GBK ") , if the wrong decoding method is used, such as using UTF-8, then the resulting is probably garbled. In other words: GBK--->iso-8859-1--->gbk, UTF-8--->iso-8859-1--->utf-8 conversion process is not a problem. However, GBK--->iso-8859-1--->utf-8, UTF-8--->iso-8859-1--->gbk byte direct transcoding may lead to garbled, requiring additional conversion process.

Remember: use GetBytes (name_of_charset) and new String (bytes, name_of_charset) sparingly unless you are well aware of the original character encoding and encoding used by the transport protocol. It is recommended to use server-based configuration, filter settings Request/response characterencoding, Content Type property. There is also the Pageencoding property of the JSP page, the content Type property of the HTML meta element. Try to avoid frequent string transcoding in your code, which reduces efficiency and increases risk.

Java Character Set

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Character Set

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Character Set

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support