Research on Java and related character set coding problem sharing

Research on Java and related character set coding problem sharing _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags http request

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article will focus on the above issues to describe the discussion, we take the "Chinese" two words as an example to explain, find relevant information that "Chinese" GB2312 encoding is "d6d0 CEC4" for Unicode encoding for "4e2d 6587", UTF code is "E4B8AD e69687 ". (Note that the word "Chinese" has no iso8859-1 encoding, but can be "represented" with a iso8859-1 encoding).

First, the basic knowledge of coding:

The earliest encoding was iso8859-1, similar to the ASCII encoding. However, in order to facilitate the presentation of a variety of languages, the gradual emergence of a number of standard coding, important are the following:

1. iso8859-1

is a single-byte encoding, the maximum range of characters is 0-255, applied to the English series. For example, the encoding of the letter A is 0x61=97.

It is clear that the iso8859-1 encoding represents a narrow range of characters that cannot be represented in a Chinese character. However, because it is single-byte encoding, and the computer's most basic unit of representation, so many times, still use iso8859-1 encoding to express. And on many protocols, the encoding is used by default. For example, although the "Chinese" two words do not exist iso8859-1 encoding, take gb2312 encoding as an example, should be "d6d0 cec4" two characters, use iso8859-1 encoding it to open 4 bytes to represent: "D6 d0 ce C4" (in fact, in the storage time , and is also processed in bytes. If it is UTF encoding, it is 6 bytes "E4 B8 ad E6 96 87". It is clear that this representation method also needs to be based on another encoding.

2. GB2312/GBK

This is the man's GB code, specifically used to express Chinese characters, is a double-byte code, while the English alphabet and iso8859-1 consistent (compatible ISO8859-1 encoding). The GBK encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and GBK is compatible with GB2312 encoding.

3. Unicode

This is the most uniform encoding that can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is incompatible with iso8859-1 encoding and is incompatible with any encoding. However, compared to the iso8859-1 encoding, the Uniocode encoding simply adds a 0 byte to the front, such as the letter A is "00 61".

It should be noted that the fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so in many software is the use of Unicode encoding to deal with, such as Java.

4. UTF

Consider that Unicode encoding is incompatible with ISO8859-1 encoding and can easily occupy more space: Because of the English alphabet, Unicode also needs two bytes to represent. So Unicode is not easy to transfer and store. As a result, UTF encoding is generated, UTF encoding is compatible iso8859-1 encoding, and can be used to represent characters in all languages, however, UTF encoding is an indefinite length encoding, with each character varying from 1-6 bytes. In addition, the UTF code with a simple checksum function. Generally speaking, English letters are expressed in one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only relative to Unicode encoding, and using GB2312/GBK is undoubtedly the most economical if it is already known to be Chinese. On the other hand, it is worth noting that although the UTF encoding uses 3 bytes for Chinese characters, even for Chinese-language web pages, UTF encoding is more economical than Unicode encoding, because the Web page contains a lot of English characters.

Second, Java processing of characters

In writing a Java application, there are many related to character set coding, some places need to be set up correctly, and some places need to be dealt with some degree.

1. GetBytes (CharSet)

This is a standard function of Java string processing, which is to encode characters represented by strings according to CharSet, and to represent them in bytes. Note that strings are always stored in Unicode encoding in Java memory. For example, "Chinese", under normal circumstances (that is, when there is no error) stored as "4e2d 6587", if the CharSet is "GBK", it is encoded as "d6d0 cec4", and then return the byte "D6 d0 ce c4." If CharSet is "UTF8" then the last is "E4 B8 ad E6 96 87". If it is "iso8859-1", the Last Return "3f 3f" (Note: "3f 3f" is two question marks) because it cannot be encoded.

2. New String (CharSet)

This is another standard function of Java string processing, which, in contrast to the previous function, combines byte arrays with CharSet encoding and finally converts to Unicode storage. Referring to the GetBytes example above, "GBK" and "UTF8" can all produce the correct result "4e2d 6587", but Iso8859-1 finally becomes "003f 003f" (two question marks).

Because UTF8 can be used to represent/encode all characters, the new String (Str.getbytes ("UTF8"), "utf8") = = str, which is completely reversible.

3. Setcharacterencoding ()

This function is used to set the HTTP request or the corresponding encoding.

For request, refers to the content of the encoding, specified after the getparameter () can be directly obtained the correct string, if not specified, the default use of ISO8859-1 encoding, need further processing. See "Form input" below. It is noteworthy that no getparameter () can be executed until the setcharacterencoding () is executed. Description on Java Doc: This method must is called prior to reading request parameters or reading input using Getreader ().

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More