Java Character Set encoding research __ Code

Source: Internet
Author: User
Tags 0xc0 urlencode

1. Overview

This article mainly includes the following aspects: Coding basic Knowledge, Java, System software, URL, tool software and so on.

Java development, often encountered garbled problems, once encountered such a problem, often very ridiculous, everyone is unwilling to admit that their code has problems. In fact, the coding problem is not so mysterious, so unpredictable, understand the Java coding essence process is the truth.

Take a look at the picture first:



In fact, there are two aspects of coding problems: within the JVM and outside the JVM.

①java file Form class after compiling

The Java file may be encoded in a variety of ways, but the Java compiler automatically reads these encodings correctly as they are encoded in the Java file, generating class files, where the class file encoding is Unicode (specifically, UTF-16 encoding).

Therefore, define a string in Java code:

String s= "Kanji";

Regardless of the encoding used in the Java file before the compilation, they are all the same----Unicode encoding when compiled into class.

Coding in the ②JVM

When the JVM loads the class file read the class file is read correctly using Unicode encoding, the originally defined string s= "kanji"; the expression in memory is Unicode encoding.

When the call String.getbytes (), in fact, has been to buy the bane of garbled. Because this method uses the platform default character set to get the byte array corresponding to the string. In the Windows XP Chinese version, the default encoding used is GBK, run under:

public class Test {
public static void Main (string[] args) {
System.out.println ("Current JRE:" + system.getproperty ("java.version"));
SYSTEM.OUT.PRINTLN ("default character set for current JVM:" + charset.defaultcharset ());
}
}

Current Jre:1.6.0_16
Default character set for the current JVM: GBK

When different systems, databases after multiple coding, if the principle of which does not understand, it is easy to lead to garbled. Therefore, in a system, it is necessary to do a unified string coding, the unified fuzzy point, that is, external unification. For example, the method string parameters, Io flow, in Chinese system, you can use GBK, GB13080, UTF-8, UTF-16, etc. can be unified, but to choose some of the larger character sets, to ensure that any possible use of the characters can be normal display, to avoid garbled problems. (assuming that all files are ASCII) then the bidirectional conversion is not possible.

In particular, UTF-8 is not able to accommodate all the Chinese character set encoding, therefore, in special circumstances, UTF-8 turn GB18030 may appear garbled, but a group of silly B often in the Chinese system like to use UTF-8 coding and not to say the reason out. The most stupid B is, a system for many people to do, source code files Some people use GBK code, someone with UTF-8, there are people with GB18030. FK, are Chinese, also not outsourced projects, with what UTF-8 ah, nerves. The source code is all GBK18030 OK, lest the ant script compile time to prompt the character encoding that is not recognizable.

Therefore, for the Chinese system, it is best to choose GBK or GB18030 encoding (in fact, GBK is a subset of GB18030) in order to maximize the avoidance of garbled phenomena.

③ encoding of strings in memory

In-memory strings are not limited to strings that are loaded directly from the class code, there are also strings that are read from a text file, read from a database, and possibly built from a byte array, but they are largely not Unicode encoded for simple reasons of storage optimization.

Therefore, it is necessary to deal with a variety of coding problems, before processing, must be clear "source" encoding, and then the specified encoding to read correctly in memory. If it is a parameter of a method, you must actually specify the encoding of the string parameter, because this parameter may be passed by another Japanese system. When the string encoding is defined, the string can be handled correctly to avoid garbled characters.

When decoding a string, you should call the following method:

GetBytes (String charsetname)
String (byte[] bytes, string charsetname)

Instead of using a method signature without a character set name, the two methods above allow you to encode the characters in memory.

In the following description, for example, "Chinese" two words, the table can be known that its GB2312 encoding is "d6d0 CEC4", Unicode Encoding "4e2d 6587", UTF code is "E4b8ad e69687." Note that the two words are not iso8859-1 encoded, but can be "represented" using ISO8859-1 encoding.

2. Basic knowledge of coding

The earliest encoding is iso8859-1, and is similar to ASCII (American Standard Code for Information Interchange, US Information Interchange standard codes). However, in order to facilitate the presentation of a variety of languages, the gradual emergence of a number of standard coding, important are the following several.

2.1. Iso8859-1

is a single-byte encoding, the maximum range of characters is 0-255, applied to the English series. For example, the letter ' a ' is encoded as 0x61=97.

It is clear that the iso8859-1 encoding represents a narrow range of characters that cannot be represented in a Chinese character. However, because it is single-byte encoding, and the computer's most basic unit of representation, so many times, still use iso8859-1 encoding to express. And on many protocols, the encoding is used by default. For example, although the "Chinese" two words do not exist iso8859-1 encoding, take gb2312 encoding as an example, should be "d6d0 cec4" two characters, use iso8859-1 encoding it to open 4 bytes to represent: "D6 d0 ce C4" (in fact, in the storage time , and is also processed in bytes. If it is UTF encoding, it is 6 bytes "E4 B8 ad E6 96 87". It is clear that this representation method also needs to be based on another encoding.

2.2. GB2312/GBK

This is the man's GB code, specifically used to express Chinese characters, is a double-byte code, while the English alphabet and iso8859-1 consistent (compatible ISO8859-1 encoding). The GBK encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and GBK is compatible with GB2312 encoding.

2.3. Unicode

This is the most uniform encoding that can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is incompatible with iso8859-1 encoding and is incompatible with any encoding. However, relative to the iso8859-1 encoding, the Uniocode encoding simply adds a 0 byte to the front, such as the letter ' a ' to ' 00 61 '.

It should be noted that the fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so in many software is the use of Unicode encoding to deal with, such as Java.

2.4. UTF

Consider that Unicode encoding is incompatible with ISO8859-1 encoding and can easily occupy more space: Because of the English alphabet, Unicode also needs two bytes to represent. So Unicode is not easy to transfer and store. As a result, UTF encoding is generated, UTF encoding is compatible iso8859-1 encoding, and can be used to represent characters in all languages, however, UTF encoding is an indefinite length encoding, with each character varying from 1-6 bytes. In addition, the UTF code with a simple checksum function. Generally speaking, English letters are expressed in one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only relative to Unicode encoding, and using GB2312/GBK is undoubtedly the most economical if it is already known to be Chinese. On the other hand, it is worth noting that although the UTF encoding uses 3 bytes for Chinese characters, even for Chinese-language web pages, UTF encoding is more economical than Unicode encoding, because the Web page contains a lot of English characters.

3. Java Processing of characters

In Java application software, there are many related to character set coding, some places need to be set up correctly, some places need to do some degree of processing.

3.1. GetBytes (CharSet)

This is a standard function of Java string processing, which is to encode characters represented by strings according to CharSet, and to represent them in bytes. Note that strings are always stored in Unicode encoding in Java memory. For example, "Chinese", under normal circumstances (that is, when there is no error) stored as "4e2d 6587", if the CharSet is "GBK", it is encoded as "d6d0 cec4", and then return the byte "D6 d0 ce c4." If CharSet is "UTF8" then the last is "E4 B8 ad E6 96 87". If it is "iso8859-1", then it returns "3f 3f" (two question marks) because it cannot be encoded.

3.2. New String (CharSet)

This is another standard function of Java string processing, which, in contrast to the previous function, combines byte arrays with CharSet encoding and finally converts to Unicode storage. Referring to the GetBytes example above, "GBK" and "UTF8" can all produce the correct result "4e2d 6587", but Iso8859-1 finally becomes "003f 003f" (two question marks).

Because UTF8 can be used to represent/encode all characters, the new String (Str.getbytes ("UTF8"), "utf8") = = str, which is completely reversible.

3.3. Setcharacterencoding ()

This function is used to set the HTTP request or the corresponding encoding.

For request, refers to the content of the encoding, specified after the getparameter () can be directly obtained the correct string, if not specified, the default use of ISO8859-1 encoding, need further processing. See "Form input" below. It is noteworthy that no getparameter () can be executed until the setcharacterencoding () is executed. Description on Java Doc: This method must is called prior to reading request parameters or reading input using Getreader (). Also, the specified is valid only for the Post method and not for the Get method. The reason is that when the first getparameter () is executed, Java will parse all submissions according to the encoding, and the subsequent getparameter () is no longer analyzed, so setcharacterencoding () is invalid. For a Get method to submit the form, the content submitted in the URL, the beginning of the code analysis of all the submissions, setcharacterencoding () is naturally invalid.

For response, the encoding of the specified output, which is passed to the browser to tell the browser what the output is encoded in.

3.4. Processing process

Below is an analysis of two representative examples of how Java handles coding-related problems.

3.4.1. Form input

User input * (gbk:d6d0 cec4) browser * (gbk:d6d0 cec4) Web server iso8859-1 (00d6 00d 000ce 00c4) class, which needs to be handled in class: G Etbytes ("Iso8859-1") is c4,new GBK for d6 d0 ce d6d0 String ("Cec4"), and in memory is 4e2d 6587 in Unicode encoding.

L user input encoding is related to the code specified by the page, and also related to the user's operating system, so it is uncertain, the example of GBK.

L from browser to Web server, you can specify the character set to use when submitting content in the form, otherwise the encoding specified by the page is used. And if you enter the parameters directly in the URL, the encoding is often encoded by the operating system itself, because it is irrelevant to the page. The above still takes GBK coding as an example.

L WEB Server received a byte stream, the default time (GetParameter) will be processed with ISO8859-1 encoding, the result is not correct, so the need for processing. But if you set the code in advance (through request). Setcharacterencoding ()), you can get the correct result directly.

• It is a good practice to specify the encoding on the page, otherwise you may lose control and cannot specify the correct encoding.

3.4.2. File compilation

Suppose the file is saved GBK encoding, and the compilation has two encoding options: GBK or ISO8859-1, the default encoding for Chinese windows, the default encoding for Linux, and, of course, encoding at compile time.

JSP * (gbk:d6d0 cec4) Java file * (gbk:d6d0 cec4) compiler read Uincode (gbk:4e2d 6587; iso8859-1:00d6 00d 000ce 00c4) Compiler write UTF (GBK:E4B8AD e69687; iso8859-1: *) compiled file Unicode (gbk:4e2d 6587; iso8859-1:00d6 00d 000CE 00C4) class. So it is not correct to save with GBK encoding, and the result compiled with Iso8859-1.

Class Unicode (4e2d 6587) system.out/jsp.out GBK (d6d0 cec4) OS Console/browser.

l files can be stored in a variety of encoding, Chinese windows, the default is ANSI/GBK.

l when the compiler reads a file, it needs to encode the file and, if unspecified, use the system default encoding. General class file, is stored in the system default encoding, so the compilation will not be a problem, but for JSP files, if you edit the save in Chinese windows, and deployment in English Linux run/compile, there will be problems. So you need to specify the encoding in the JSP file with pageencoding.

When Java compiles, it translates into a unified Unicode encoding process, which is then converted to UTF encoding when it is finally saved.

l when the system output characters, will be the specified encoding output, for Chinese windows, System.out will use GBK encoding, and for response (browser), the contenttype specified by the JSP file header is used, or the encoding can be specified directly for response. At the same time, it will tell browser the code of the Web page. If not specified, the ISO8859-1 encoding is used. For Chinese, you should specify the encoding of the output string for browser.

L browser when displaying a Web page, first use the encoding specified in response (the contenttype specified in the JSP file header is ultimately reflected on the response), and if not specified, the ContentType in the Meta item designation in the Web page is used.

3.5. Several settings

For Web applications, the settings or functions associated with encoding are as follows.

3.5.1. JSP compilation

Specifies the storage encoding for the file, and it is clear that the setting should be placed at the beginning of the file. For example: <% @page pageencoding= "GBK"%>. In addition, for general class files, you can specify the encoding at compile time.

3.5.2. JSP output

Specifies that the file output to browser is the encoding used, and this setting should also be placed at the beginning of the file. For example: <%@ page contenttype= "text/html; charset= GBK "%>. This setting is equivalent to Response.setcharacterencoding ("GBK").

3.5.3. Meta settings

Specifies the encoding used by the Web page, which is especially useful for static Web pages. Because static Web pages cannot take the settings of a JSP and cannot perform response.setcharacterencoding (). For example: <meta http-equiv= "Content-type" content= "text/html; CHARSET=GBK "/>

If the JSP output and meta settings are used in the same way as the two encoding, then the JSP specifies the precedence. Because the JSP is specified directly in the response.

It should be noted that Apache has a setting that specifies the encoding for Web pages that are not encoded, which is equivalent to the encoding specified by the JSP, and therefore overrides the meta designation in the static Web page. It is recommended that this setting be turned off.

3.5.4. Form settings

When the browser submits the form, you can specify the appropriate encoding. For example: <form accept-charset= "gb2312" >. Typically, you do not have to use this setting, and the browser uses the page's encoding directly.

4. System software

Several related system software are discussed below.

4.1. mysql Database

Obviously, to support multiple languages, the encoding of the database should be set to UTF or Unicode, while the UTF is more suitable for storage. However, Unicode is more appropriate if the Chinese data contains very few English letters.

The encoding of the database can be set through MySQL's configuration file, such as Default-character-set=utf8. You can also set it in the database link URL, for example: Useunicode=true&characterencoding=utf-8. Note that the two should be consistent, in the new SQL version, in the database link URL can not be set, but also can not be the wrong setting.

4.2. Apache

Appache and encoding-related configurations are in httpd.conf, such as Adddefaultcharset UTF-8. As mentioned earlier, this feature sets the encoding of all static pages to UTF-8, preferably turning off the feature.

In addition, Apache also has a separate module to handle the Web page response header, which may also set the encoding.

4.3. Linux default encoding

The Linux default encoding described here refers to the RUN-TIME environment variables. Two important environment variables are lc_all and lang, and the default encoding affects the behavior of the Java UrlEncode, as described below.

The recommendations are set to ZH_CN. UTF-8 ".

4.4. Other

To support the Chinese filename, Linux should specify the character set when loading the disk, for example: mount/dev/hda5/mnt/hda5/-T Ntfs-o iocharset=gb2312.

Also, as mentioned earlier, the information submitted using the Get method does not support request.setcharacterencoding (), but it is possible to specify the character set through the Tomcat configuration file, in the Server.xml file of Tomcat, as in:< Connector ... uriencoding= "GBK"/>. This approach will set all requests uniformly, not set for specific pages, and not necessarily the same encoding that browser uses, so sometimes it is not expected.

5. URL Address

It is cumbersome to have a Chinese character in the URL address, as described earlier, when a form is submitted using the Get method, and when the Get method is used, the parameter is included in the URL.

5.1. URL encoding

For some special characters in the URL, the browser automatically encodes. These characters, in addition to "/?&", also include Unicode characters, such as a man. At this time the coding is more special.

IE has an option to "Always Send URLs using UTF-8", when this option is valid, IE will UTF-8 encoding for special characters and URL encoding. If this option is not valid, the default encoding "GBK" is used, and URL encoding is not performed. However, the parameters that follow the URL are always not encoded, which is equivalent to an invalid UTF-8 option. For example, "Chinese. html?a= Chinese", when the UTF-8 option is valid, the link "Chinese. html?a=\x4e\x2d\x65\x87" is sent, and when the UTF-8 option is invalid, the link is sent "\x4e\x2d\x65\x87.html?a=\x4e\ X2d\x65\x87 ". Note that the "Chinese" two characters in front of the latter are only 4 bytes, whereas the former has 18 bytes, which is mainly the reason for URL encoding.

When Web server (Tomcat) receives the link, URL decoding is done, i.e., "%" is removed and identified by ISO8859-1 encoding (described above, which can be set to other encodings using urlencoding). The results of the above examples are "\?\?\?\?\?\". Html?a=\n\-\e\? " and "\n\-\e\?" Html?a=\n\-\e\? ", note that the former" Chinese "two words back to 6 characters. "\u" is used here, which means Unicode.

Therefore, because of the different client settings, the same link, the server on the different results. Many people have encountered this problem, but there is no good solution. So some sites will suggest users try to turn off the UTF-8 option. However, a better approach is described below.

5.2. Rewrite

As everyone knows, Apache has a powerful rewrite module that does not describe its functionality. It should be explained that the module will automatically decode the URL (remove%), which completes some of the above Web server (Tomcat) functionality. There is a documentation that says you can use the [NE] parameter to turn off the feature, but my experiment didn't work, probably because the version (I'm using the Apache 2.0.54) problem. In addition, when the parameters contain symbols such as "?&", the function will cause the system to not get the normal result.

Rewrite itself seems to be purely byte-processing, regardless of string encoding, so it does not cause coding problems.

5.3. Urlencode.encode ()

This is the URL encoding function that Java itself provides, and the work done is similar to what the browser does when the above UTF-8 option is valid. It is worth noting that Java has not agreed to use this method (deprecated) without specifying an encoding. The encoding designation should be added at the time of use.

When encoding is not specified, the method uses the system default encoding, which causes the software to run the result to be indeterminate. For example, for "Chinese", when the system defaults to "gb2312", the result is "n-e", and the default encoding is "UTF-8", the result is "Chinese", the follow-up program will be difficult to deal with. In addition, the system is the default encoding is run by the environment variable lc_all and Lang, such as the decision, there have been tomcat after the restart of the problem of garbled, and finally depressed the discovery is because of the modification of these two environmental variables.

It is recommended that the unification be specified as "UTF-8" encoding, and the appropriate program may need to be modified.

5.4. One solution

As mentioned above, because the browser settings are different, for the same link, the Web server received a different content, and the software system can not know the difference between this, so the agreement is still flawed.

For specific issues, should not be lucky to think that all the customer's IE settings are UTF-8 effective, and should not be rude to suggest users to modify the IE settings, you know, users can not remember the settings of each Web server. So, the next solution is to make your program a little more intelligent: based on the content to analyze whether the encoding is UTF-8.

Fortunately, the UTF-8 code is quite regular, so you can analyze the transmission of the link content to determine whether the correct UTF-8 characters, if it is, the UTF-8 processing, if not, then use the client default encoding (such as "GBK"), the following is a judge whether the UTF-8 example , it's easy to understand if you know the rules.

public static Boolean IsValidUtf8 (byte[] B,int amaxcount) {

int llen=b.length,lcharcount=0;

for (int i=0;i<llen && lcharcount<amaxcount;++lcharcount) {

Byte lbyte=b[i++];//to fast operation, + + now, ready for the following for (;;)

if (lbyte>=0) continue;//>=0 is normal ASCII

if (lbyte< (byte) 0xc0 | | lbyte> (BYTE) 0xfd) return false;

int lcount=lbyte> (byte) 0xfc?5:lbyte> (byte) 0xf8?4

:lbyte> (Byte) 0xf0?3:lbyte> (byte) 0xe0?2:1;

if (I+lcount>llen) return false;

for (int j=0;j<lcount;++j,++i) if (b[i]>= (byte) 0xc0) return false;

}

return true;

}

Accordingly, an example of using the above method is as follows:

public static string Geturlparam (String astr,string adefaultcharset)

Throws unsupportedencodingexception{

if (astr==null) return null;

Byte[] Lbytes=astr.getbytes ("iso-8859-1");

return new String (Lbytes,stringutil.isvalidutf8 (lbytes)? UTF8 ": adefaultcharset);

}

However, this approach is also flawed, as in the following two areas:

L does not include the identification of the user's default encoding, which can be judged by the language of the request information, but not necessarily correct, as we sometimes enter some Korean, or other text.

L may incorrectly judge UTF-8 characters, an example of "learning" two words, whose GBK encoding is "\xd1\xa7\xcf\xb0", if the IsValidUtf8 method is used above to judge, returns true. You may consider using a more rigorous method of judgment, but it is not a good estimate.

There is an example that Google has also encountered these problems, and also adopted a similar approach to the above, for example, if you enter the Address bar "[Url]http://www.google.com/search?hl=zh-cn&newwindow =1&q=[/url] Learning ", Google will not be able to identify correctly, and other Chinese characters can generally be recognized normally.

Finally, it should be added that if you do not use the rewrite rule, or submit data through a form, you are not necessarily experiencing the problem, because you can specify the desired encoding when submitting the data. In addition, the Chinese file name does cause problems and should be used with caution.

6. Other

Some other questions related to coding are described below.

6.1. SecureCRT

In addition to the browser and console is related to coding, some clients are also very relevant. For example, when using SECURECRT to connect Linux, you should let the SECURECRT display code (different session, can have different encoding settings) and Linux coding environment variables to maintain consistent. otherwise see some help information, it may be garbled.

In addition, MySQL has its own encoding settings, should also maintain and SECURECRT display coding consistent. Otherwise, when the SQL statement is executed through SECURECRT, the Chinese characters may not be processed, and the query results will appear garbled.

For Utf-8 files, many editors (such as Notepad) add three invisible flag bytes to the beginning of the file, and if you are a MySQL input file, you must remove the three characters. (The three characters can be removed with the Linux VI Save). An interesting phenomenon is that, in Chinese windows, create a new TXT file, open with Notepad, enter the "connect" two words, save, and then open, you will find two words are gone, leaving only a small black spot.

6.2. Filter

If you need to set the encoding uniformly, it is a good choice to set it through the filter. In filter class, you can set the encoding for the request or response you want. Participate in the above setcharacterencoding (). This class Apache has given an example setcharacterencodingfilter that can be used directly.

6.3. Post and get

It is obvious that the URL is more readable when submitting information by post, and it is easy to use setcharacterencoding () to handle character set problems. But the URL formed by the Get method can more easily express the actual content of the Web page and can be used for collection.

From a unified point of view, it is recommended to use Get method, which requires that the parameters in the program to be special treatment, and can not use the convenience of setcharacterencoding (), if not considered rewrite, there is no UTF-8 problem of IE, You can consider setting up uriencoding to easily get the parameters in the URL.

6.4. Simple traditional Code conversion

GBK contains both simplified and traditional code, that is, the same word, because of the different encoding, in the GBK code under the two words. Sometimes, in order to get the complete results correctly, we should unify the traditional and simplified. You can consider to convert all the traditional characters in UTF, GBK to the corresponding simplified characters, BIG5 encoded data, should also be converted into the corresponding simplified characters. Of course, it is still stored in UTF encoding.

For example, for language languages, UTF is represented as "\xe8\xaf\xad\xe8\xa8\x80 \xe8\xaa\x9e\xe8\xa8\x80", and it should be two identical after the conversion of the simplified traditional encoding "\xe8\xaf\xad\xe8\xa8\ X80> ".

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.