Research on Java and related character set encoding

Source: Internet
Author: User
Research on Java and character set encoding-Linux general technology-Linux programming and kernel information. The following is a detailed description. 1. Overview

This article mainly includes the following aspects: Basic coding knowledge, java, system software, url, tool software, etc.

In the following description, we will take the word "Chinese" as an example. We can see that its GB2312 encoding is "d6d0 cec4" and Its Unicode encoding is "4e2d 6587 ", UTF Encoding is "e4b8ad e69687 ". note that the two words do not have iso8859-1 encoding, but can be represented by iso8859-1 encoding ".

2. Basic coding knowledge

The earliest encoding was iso8859-1, which is similar to ascii encoding. However, many standard encodings have gradually emerged to facilitate representation of various languages. The following are important.

2.1. iso8859-1

It is a single-byte encoded string with a maximum character range of 0-255. It is used in English series. For example, the letter a is encoded as 0x61 = 97.

It is obvious that the iso8859-1 encoding represents a narrow range of characters that cannot represent Chinese characters. However, because it is a single-byte encoding, and the computer's most basic representation unit, so many times, still use iso8859-1 encoding to represent. This encoding is used by default in many protocols. For example, although the word "Chinese" does not exist iso8859-1 encoding, The gb2312 encoding, for example, should be "d6d0 cec4" two characters, when using iso8859-1 encoding, it is split into 4 bytes to indicate: "d6 d0 ce c4" (in fact, it is also processed in bytes during storage ). If it is UTF Encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation method also needs to be based on another encoding method.

2.2. GB2312/GBK

This is the man's Country Code, specifically used to represent Chinese characters, is a dubyte encoding, while English letters and iso8859-1 consistent (compatible with iso8859-1 encoding ). Gbk encoding can be used to both traditional and simplified Chinese characters, while gb2312 can only represent simplified Chinese characters. gbk is compatible with gb2312 encoding.

2.3. unicode

This is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length dubyte (also four bytes) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, is not compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte before, for example, the letter a is "00 61 ".

It should be noted that fixed-length encoding is easy for computer processing (note that GB2312/GBK is not fixed-length encoding), while unicode can be used to represent all characters, therefore, unicode encoding is used in many software systems, such as java.

2.4. UTF

Considering that unicode encoding is not compatible with iso8859-1 encoding and is easier to use, unicode also requires two bytes for English letters. Unicode is not easy to transmit and store. Therefore, utf Encoding is produced. utf Encoding is compatible with iso8859-1 encoding and can also be used to represent characters in all languages. However, utf Encoding is not long encoding, the length of each character ranges from 1 to 6 bytes. In addition, utf Encoding comes with a simple verification function. Generally, English letters are represented in one byte, while Chinese characters are represented in three bytes.

Note: Although utf is used to use less space, it is undoubtedly the most economical to use GB2312/GBK if it is known to be Chinese characters as compared with unicode encoding. On the other hand, it is worth noting that although utf uses three bytes for Chinese characters, even for Chinese webpages, utf Encoding will save compared with unicode encoding, because the webpage contains many English characters.

3. java processing of Characters

In java application software, character set encoding is involved in many cases. In some cases, correct settings are required, and in some cases, certain processing is required.

3.1. getBytes (charset)

This is a standard function for java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in the java memory in unicode encoding. For example, if "Chinese" is stored as "4e2d 6587" under normal circumstances (I .e. when there is no error), if charset is "gbk", it is encoded as "d6d0 cec4 ", then return The Byte "d6 d0 ce c4 ". if charset is "utf8", it is "e4 b8 ad e6 96 87 ". if it is a "iso8859-1", "3f 3f" (two question marks) will be returned because it cannot be encoded ).

3.2. new String (charset)

This is another standard function for java string processing. In contrast to the previous function, it combines byte arrays according to charset encoding and finally converts them to unicode storage. Referring to the above getBytes example, "gbk" and "utf8" both can get the correct result "4e2d 6587", but the iso8859-1 finally becomes "003f 003f" (two question marks ).

Because utf8 can be used to represent/encode all characters, new String (str. getBytes ("utf8"), "utf8") = str, that is, completely reversible.

3.3. setCharacterEncoding ()

This function is used to set the http request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specified, the correct string can be obtained directly through getParameter (). If not specified, the iso8859-1 encoding is used by default and needs further processing. See the following "form input". It is worth noting that no getParameter () can be executed before setCharacterEncoding () is executed (). Java doc Description: This method must be called prior to reading request parameters or reading input using getReader (). This parameter is only valid for the POST method and invalid for the GET method. The cause of the analysis should be that when the first getParameter () is executed, java will analyze all submitted content according to the encoding, and the subsequent getParameter () will not be analyzed, so setCharacterEncoding () invalid. For the form submitted by the GET method, the submitted content is in the URL, and all submitted content has been analyzed according to encoding at the beginning. setCharacterEncoding () is naturally invalid.

For response, the encoding of the output content is specified. At the same time, this setting is passed to the browser to tell the browser the encoding of the output content.

3.4. handling process

The following two representative examples illustrate how java handles coding problems.

3.4.1. form input

User input * (gbk: d6d0 cec4) browser * (gbk: d6d0 cec4) web server iso8859-1 (00d6 00d 000ce 00c4) class, which needs to be processed in the class: getbytes ("iso8859-1") is d6 d0 ce c4, new String ("gbk") is d6d0 cec4, in memory in unicode encoding is 4e2d 6587.

L The encoding method entered by the user is related to the page-specific encoding and the user's operating system. Therefore, it is uncertain. The above example uses gbk as an example.

L from browser to web server, you can specify the character set used for content submission in the form. Otherwise, the encoding specified by the page will be used. What if I use it directly in the url? Input parameters, the encoding is usually the operating system code, because it is irrelevant to the page. The above uses gbk encoding as an example.

L The Web server receives a byte stream. By default, (getParameter) will be processed in iso8859-1 encoding, and the result is incorrect, so it needs to be processed. However, if the encoding (via request. setCharacterEncoding () is set in advance, the correct result can be obtained directly.

L it is a good habit to specify encoding on the page, otherwise it may be out of control and cannot be specified correctly.

3.4.2. File Compilation

Assume that the file is saved by gbk encoding, and there are two encoding options for compilation: gbk or iso8859-1, the former is the default encoding of Chinese Windows, the latter is the default encoding of linux, you can also specify the encoding during compilation.

Jsp * (gbk: d6d0 cec4) java file * (gbk: d6d0 cec4) compiler read uincode (gbk: 4e2d 6587; iso8859-1: 00d6 00d 000ce 00c4) compiler write utf (gbk: e4b8ad e69687; iso8859-1: *) compiled file unicode (gbk: 4e2d 6587; iso8859-1: 00d6 00d 000ce 00c4) class. therefore, it is not correct to use gbk encoding to save and compile with iso8859-1.

Class unicode (4e2d 6587) system. out/jsp. out gbk (d6d0 cec4) OS console/browser.

L files can be saved in multiple encoding modes. In Chinese Windows, the default value is ansi/gbk.

L when the compiler reads a file, it needs to get the encoding of the file. If not specified, the system default encoding is used. Generally, the class file is saved in the default encoding of the system, so there will be no compilation problem. However, for jsp files, if they are edited and saved in Chinese windows, they will be deployed in English linux to run/compile, the problem may occur. Therefore, you must use pageEncoding to specify the encoding in the jsp file.

L during Java compilation, it will be converted to a unified unicode encoding process, and then converted to utf Encoding during storage.

L when the System outputs characters, it will output according to the specified encoding. For Chinese windows. for response (browser), The contentType specified by the jsp file header is used, or the encoding can be directly specified for response. At the same time, it will tell the browser webpage code. If not specified, iso8859-1 encoding is used. For Chinese characters, the encoding of the output string should be specified for browser.

L when browser displays the webpage, it first uses the encoding specified in response (the contentType specified in the jsp file header is also reflected in response). If not specified, the contentType specified by the meta item in the webpage is used.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.