Java Character Set encoding

Last Update:2014-10-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

In the following description, the "Chinese" two words as an example, the table can be found to know its GB2312 encoding is "d6d0 CEC4", Unicode Encoding "4e2d 6587", UTF code is "E4b8ad e69687". Note that these two words are not iso8859-1 encoded, but can be "represented" by iso8859-1 encoding.

2. Basics of coding

The earliest encoding is iso8859-1, similar to ASCII encoding. However, in order to facilitate the presentation of a variety of languages, there are a number of standard coding, the following are important.

2.1

is a single-byte encoding and can represent a range of 0-255 characters, which is applied to the English series. For example, the letter ' a ' is encoded as 0x61=97.

It is clear that the iso8859-1 encoding represents a narrow character range and cannot represent Chinese characters. However, because it is a single-byte encoding that is consistent with the computer's most basic representation, many times it is still represented by iso8859-1 encoding. And on many protocols, the code is used by default. For example, although "Chinese" two words do not exist iso8859-1 encoding, in gb2312 encoding as an example, should be "d6d0 cec4" two characters, use iso8859-1 encoding when it is opened to 4 bytes to represent: "D6 d0 ce C4" (in fact, in the storage of the time , which is also handled in bytes). In the case of UTF encoding, it is 6 bytes "E4 B8 ad E6 96 87". Obviously, this representation needs to be based on another encoding.

2.2 GB2312/GBK

This is the Chinese national Standard Code, specifically used to denote Chinese characters, is a double-byte encoding, and the English alphabet and iso8859-1 consistent (compatible with ISO8859-1 encoding). where GBK encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, GBK is compatible with GB2312 encoding.

2.3 Unicode

This is the most uniform encoding that can be used to represent all language characters, and is a fixed-length double-byte (also four-byte) encoding, including the English alphabet. So it can be said that it is incompatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to the iso8859-1 encoding, the Uniocode encoding only adds a 0 byte to the front, such as the letter ' a ' is "00 61".

2.4 UTF

Given that Unicode encoding is incompatible with ISO8859-1 encoding, it is easy to take up more space: Because Unicode also requires two bytes for the English alphabet. So Unicode is not easy to transfer and store. As a result, UTF encoding, UTF encoding is compatible with ISO8859-1 encoding, can also be used to represent all language characters, however, UTF encoding is an indefinite length encoding, each character's length varies from 1-6 bytes. In addition, UTF code comes with a simple checksum function. In general, the English alphabet is expressed in one byte, while the characters use three bytes.

Note that although UTF is used in order to use less space, it is only relative to Unicode encoding, if you already know is kanji, then using GB2312/GBK is undoubtedly the most economical. On the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, UTF encoding is less than Unicode encoding even for kanji pages, because the page contains a lot of English characters.

3. java handling of characters

In Java applications, there will be multiple character set encoding, some places need to make the correct settings, some areas need to be a certain degree of processing.

3.1. GetBytes (CharSet)

This is a standard function of Java string processing, which is to encode the character represented by the string in charset and byte notation. Note that strings are always stored in Java memory by Unicode encoding. For example, "Chinese", normally (i.e. no error) is stored as "4e2d 6587", if CharSet is "GBK", it is encoded as "d6d0 CEC4" and then returns the byte "D6 d0 ce c4". If CharSet is "UTF8" then the end Is "E4 B8 ad E6 96 87". If it is "iso8859-1", it will return "3f 3f" (two question marks) because it cannot be encoded.

3.2. New String (CharSet)

This is another standard function of Java string processing, and in contrast to the previous function, the byte array is identified by CharSet encoding and finally converted to Unicode storage. Referring to the GetBytes example above, both "GBK" and "UTF8" can draw the correct result "4e2d 6587", but Iso8859-1 eventually becomes "003f 003f" (two question marks).

Because UTF8 can be used to represent/encode all characters, the new String (Str.getbytes ("UTF8"), "utf8") = = =-STR, which is completely reversible.

3.3. Setcharacterencoding ()

This function is used to set the HTTP request or the corresponding encoding.

For request, the encoding of the commit content, which can be obtained by getparameter () to obtain the correct string directly, or, if not specified, by default using ISO8859-1 encoding, which requires further processing. See "Form input" below. It is important to note that no getparameter () can be performed until setcharacterencoding () is executed. Java Doc Description: This method must is called prior to reading request parameters or reading input using Getreader (). Furthermore, the designation is valid only for the Post method and not for the Get method. The reason for the analysis is that when the first getparameter () is executed, Java will parse all the submissions according to the encoding, and the subsequent getparameter () is no longer parsed, so setcharacterencoding () is invalid. In the case of the Get method submission form, the content submitted in the URL, the beginning of the analysis of all submissions according to the Code, setcharacterencoding () naturally invalid.

For response, it is the encoding of the specified output, and the setting is passed to the browser, telling the browser what encoding to use for the output.

3.4. Process

The following is an analysis of two representative examples of how Java handles coding-related problems.

3.4.1. Form input

User input * (gbk:d6d0 cec4) browser * (gbk:d6d0 cec4) Web server iso8859-1 (00d6 00d 000ce 00c4) class, need to be processed in class: G Etbytes ("Iso8859-1") is D6 d0 ce c4,new String ("GBK") is d6d0 CEC4, in-memory Unicode encoding is 4e2d 6587.

L user input encoding is related to the encoding specified by the page, but also to the user's operating system, so is not sure, the above example to GBK for example.

L from browser to Web server, you can specify the character set used when submitting content in the form, otherwise the encoding specified by the page is used. If you enter parameters directly in the URL, the encoding is often the encoding of the operating system itself, because it is not related to the page at this time. The GBK encoding is still an example.

The WEB server receives a byte stream, which, by default (GetParameter), is processed in iso8859-1 encoding, and the result is not correct, so it needs to be processed. But if the encoding is pre-set (via request. Setcharacterencoding ()), it is possible to obtain the correct result directly.

It is a good practice to specify the encoding in the page, otherwise you may lose control and you cannot specify the correct encoding.

3.4.2. File compilation

Suppose the file is GBK encoded and compiled with two encoding options: GBK or Iso8859-1, which is the default encoding for Chinese windows, which is the default encoding for Linux, and of course it can be specified at compile time.

JSP * (gbk:d6d0 cec4) Java file * (gbk:d6d0 cec4) compiler read Uincode (gbk:4e2d 6587; iso8859-1:00d6 00d 000ce 00c4) Compiler write UTF (GBK:E4B8AD e69687; iso8859-1: *) compiled file Unicode (gbk:4e2d 6587; iso8859-1:00d6 00d 000CE 0 0C4) class. So it is saved with GBK encoding, and the result of compiling with iso8859-1 is incorrect.

Class Unicode (4e2d 6587) system.out/jsp.out GBK (d6d0 cec4) OS Console/browser.

L files can be saved in a variety of encodings, under Chinese windows, the default is ANSI/GBK.

When the compiler reads the file, it needs to get the encoding of the file and, if unspecified, uses the system default encoding. The generic class file is saved with the system default encoding, so there is no problem with the compilation, but for JSP files, if you edit the save under Chinese windows and the deployment runs/compiles under English Linux, there will be problems. Therefore, you need to specify the encoding in the JSP file with pageencoding.

When Java is compiled, it is converted to Unicode encoding, which is then converted to UTF encoding when it is last saved.

l when the system outputs characters, it will be output according to the specified encoding, for Chinese windows, System.out will use GBK encoding, and for response (browser), the contenttype specified by the JSP file header is used, or the encoding can be specified directly for response. At the same time, the encoding of the browser Web page will be told. If not specified, the ISO8859-1 encoding is used. For Chinese, the encoding of the output string should be specified for browser.

L browser when the page is displayed, first use the encoding specified in response (the contenttype specified by the JSP header is ultimately reflected on the response), and if not specified, the ContentType in the meta-item designation in the Web page is used.

3.5. Settings in several places

For Web applications, the encoding-related settings or functions are as follows.

3.5.1. JSP compilation

Specifies the storage encoding of the file, and it is obvious that the setting should be placed at the beginning of the file. For example: <% @page pageencoding= "GBK"%>. In addition, for a generic class file, you can specify the encoding at compile time.

3.5.2. JSP output

Specifies that the file output to browser is the encoding used, and that the setting should also be placed at the beginning of the file. For example: <%@ page contenttype= "text/html; charset= GBK "%>. This setting is equivalent to Response.setcharacterencoding ("GBK").

3.5.3. Meta settings

Specifies the encoding used by the Web page, which is especially useful for static Web pages. Because static Web pages cannot be set by JSP, and response.setcharacterencoding () cannot be executed. For example: <meta http-equiv= "Content-type" content= "text/html; CHARSET=GBK "/>

If both JSP output and meta settings are used in two encoding designations, the JSP specifies the precedence. Because the JSP designation is directly reflected in the response.

It is important to note that Apache has a setting that specifies the encoding for a page that is not encoded, which is equivalent to the encoding specified by the JSP, so it overrides the meta-designation in the static Web page. So it was suggested that the setting be turned off.

3.5.4. Form settings

When the browser submits the form, you can specify the appropriate encoding. For example: <form accept-charset= "gb2312" >. In general, you do not have to use this setting, the browser will directly use the page encoding.

4. System software

Several related system software are discussed below.

4.1. mysql Database

Obviously, to support multiple languages, you should set the database encoding to UTF or Unicode, and UTF is more suitable for storage. However, if the Chinese data contains very few English letters, in fact, Unicode is more suitable.

The encoding of the database can be set through the MySQL configuration file, such as Default-character-set=utf8. You can also set it in the database link URL, for example: Useunicode=true&characterencoding=utf-8. Note that the two should be consistent, in the new SQL version, in the database link URL can not be set, but also cannot be the wrong setting.

4.2. Apache

Appache and encoding are related to the configuration in httpd.conf, such as Adddefaultcharset UTF-8. As mentioned earlier, this feature sets the encoding of all static pages to UTF-8, which is best turned off.

In addition, Apache has a separate module to handle the Web response header, which may also be set up for encoding.

4.3. Linux default encoding

The Linux default encoding described here refers to the environment variables at runtime. Two important environment variables are lc_all and lang, and the default encoding affects the behavior of Java UrlEncode, as described below.

The recommendations are set to ZH_CN. UTF-8 ".

4.4. Other

To support the Chinese file name, Linux should specify a character set when loading the disk, for example: mount/dev/hda5/mnt/hda5/-T Ntfs-o iocharset=gb2312.

Also, as mentioned earlier, the information submitted using the Get method does not support request.setcharacterencoding (), but the character set can be specified through the Tomcat configuration file, in Tomcat's Server.xml file, as in the form of:< Connector ... uriencoding= "GBK"/>. This approach will set all requests uniformly, not to specific pages, and not necessarily to the same encoding used by browser, so it is sometimes not desirable.

5. URL Address

It is cumbersome to have Chinese characters in the URL address, as described earlier when using the Get method to submit a form, when using the Get method, the parameter is included in the URL.

5.1. URL encoding

For some special characters in the URL, the browser will encode it automatically. These characters, in addition to "/?&", also include Unicode characters, such as men. This time the code is very special.

IE has an option to "Always use UTF-8 to send URLs", and when this option is in effect, IE will encode the special characters UTF-8 and URL-encode them. If the re-election entry is invalid, the default encoding of "GBK" is used, and URL encoding is not done. However, for parameters following the URL, it is always not encoded, which is equivalent to the UTF-8 option being invalid. For example, "Chinese html?a= Chinese", when the UTF-8 option is active, the link "%e4??" is sent. Text. html?a=\x4e\x2d\x65\x87 "; When the UTF-8 option is invalid, the link" \x4e\x2d\x65\x87.html?a=\x4e\x2d\x65\x87 "is sent. Note that the "Chinese" two words in front of the latter have only 4 bytes, while the former has 18 bytes, which is the main reason for URL encoding.

When Web server (Tomcat) receives the link, it will decode the URL, removing "%" and identifying it according to ISO8859-1 encoding (which is described above and can be set to another encoding using urlencoding). The results of the above examples are "\ue4\ub8\uad\ue6\u96\u87.html?a=\u4e\u2d\u65\u87" and "\u4e\u2d\u65\u87.html?a=\u4e\u2d\u65\u87", respectively, Note that the former "Chinese" two words are restored to 6 characters. This is "\u", which means Unicode.

Therefore, due to the different client settings, the same link, on the server to get different results. Many people have encountered this problem, but there is no good solution. Therefore, some websites recommend that users try to turn off the UTF-8 option. However, a better approach is described below.

5.2. Rewrite

Familiar people know that Apache has a powerful rewrite module, which does not describe its functionality here. It is necessary to note that the module automatically decodes the URL (% removal), which is the completion of some of the above Web server (Tomcat) features. There are documentation that says you can use the [NE] parameter to turn this feature off, but I did not succeed, probably because the version (I'm using Apache 2.0.54) is a problem. In addition, when the parameter contains symbols such as "?&", this function will cause the system to get no normal results.

The rewrite itself seems to be completely byte-handled, regardless of the encoding of the string, so there is no coding problem.

5.3. Urlencode.encode ()

This is the URL encoding function provided by Java itself, and the work done is similar to the work done by the browser when the above UTF-8 options are valid. It is worth noting that Java has deprecated the use of this method (deprecated) without specifying an encoding. The encoding designation should be added at the time of use.

When encoding is not specified, the method uses the system default encoding, which causes the software to run with indeterminate results. For example, "Chinese", when the system default code is "gb2312", the result is "%4e-e?", and the default encoding is "UTF-8", the result is "%e4??". Subsequent procedures will be difficult to deal with. In addition, it is said that the system default encoding is run Tomcat when the environment variables lc_all and Lang, and so on, there has been a tomcat after the restart of the problem garbled, and finally depressed the discovery is because the modification of the two environment variables.

It is recommended that you specify the "UTF-8" encoding uniformly, and you may need to modify the appropriate program.

6. Other

Some other issues related to coding are described below.

6.1. SecureCRT

In addition to browsers and consoles related to coding, some clients are also very relevant. For example, when using SECURECRT to connect Linux, you should let the SECURECRT display encoding (different session, can have different encoding settings) and the Linux encoding environment variables consistent. otherwise see some help information, it may be garbled.

In addition, MySQL has its own encoding settings, and should also be consistent with the SECURECRT display encoding. Otherwise, when the SQL statement is executed by SECURECRT, the Chinese characters may not be processed, and the query result will be garbled.

For Utf-8 files, many editors (such as Notepad) add three invisible flag bytes at the beginning of the file, and if you are a MySQL input file, you must remove the three characters. (The three characters can be removed with the VI Preservation of Linux). An interesting phenomenon is that, in the Chinese windows, create a new TXT file, open with Notepad, enter the "connected" two words, save, then open, you will find that two words are gone, leaving only a small black dot.

6.2. Filter

Setting with filter is a good choice if you need to set the encoding uniformly. In the filter class, you can set the encoding for the required request or response uniformly. Participate in the above setcharacterencoding (). This class of Apache has given an example setcharacterencodingfilter that can be used directly.

6.3. Post and get

Obviously, when you submit information as a post, the URL is more readable, and you can easily use setcharacterencoding () to handle character set problems. But the URL formed by the Get method makes it easier to express the actual content of the page and can also be used for collections.

From a unified perspective, it is recommended to use the Get method, which requires that parameters in the program is special processing, but not the convenience of using setcharacterencoding (), if the rewrite is not considered, there is no UTF-8 problem of IE, Consider setting uriencoding to make it easy to get the parameters in the URL.

6.4. Simplified traditional Code Conversion

GBK also contains both simplified and traditional encoding, that is, the same word, due to different encoding, under the GBK encoding belongs to two words. Sometimes, in order to get the complete result correctly, traditional and simplified should be unified. It is considered that all the traditional characters in UTF and GBK can be converted to the corresponding simplified characters, and the BIG5 encoded data should be converted into the corresponding simplified characters. Of course, it is still stored in UTF encoding.

For example, for "language words", using UTF as "\xe8\xaf\xad\xe8\xa8\x80 \xe8\xaa\x9e\xe8\xa8\x80", the conversion of simple traditional encoding should be two identical "\xe8\xaf\xad\xe8\xa8\ X80> ".

Java Character Set encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Character Set encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support