Classic: Research on Java and related character set encoding

Source: Internet
Author: User
Tags 0xc0

1. Overview

This article mainly includes the following aspects: Basic coding knowledge, Java, system software, URL, tool software, etc.

In the following description, we will take the word "Chinese" as an example. We can see that its gb2312 encoding is "d6d0 cec4" and Its unicode encoding is "4e2d 6587 ", the UTF code is "e4b8ad e69687 ". Note that the two words do not have iso8859-1 encoding, but they can be represented by iso8859-1 encoding ".

2. Basic coding knowledge

The earliest encoding was iso8859-1, which is similar to ASCII encoding. However, many standard encodings have gradually emerged to facilitate representation of various languages. The following are important.

2.1. iso8859-1

It is a single-byte encoded string with a maximum character range of 0-255. It is used in English series. For example, the letter A is encoded as 0x61 = 97.

It is obvious that the iso8859-1 encoding represents a narrow range of characters that cannot represent Chinese characters. However, because it is a single-byte encoding, and the computer's most basic representation unit, so many times, still use iso8859-1 encoding to represent. This encoding is used by default in many protocols. For example, although the word "Chinese" does not exist iso8859-1 encoding, The gb2312 encoding, for example, should be "d6d0 cec4" two characters, when using iso8859-1 encoding, it is split into 4 bytes to indicate: "D6 D0 ce C4" (in fact, it is also processed in bytes during storage ). For UTF Encoding, it is 6 bytes "E4 B8 ad E6 96 87 ". Obviously, this representation method also needs to be based on another encoding.

2.2. gb2312/GBK

This is the man's Country Code, specifically used to represent Chinese characters, is a dubyte encoding, while English letters and iso8859-1 consistent (compatible with iso8859-1 encoding ). GBK encoding can be used to both traditional and simplified Chinese characters, while gb2312 can only represent simplified Chinese characters. GBK is compatible with gb2312 encoding.

2.3. Unicode

This is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length dubyte (also four bytes) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, is not compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte before, for example, the letter A is "00 61 ".

It should be noted that fixed-length encoding is easy for computer processing (note that gb2312/GBK is not fixed-length encoding), while Unicode can be used to represent all characters, therefore, many software programs use Unicode encoding, such as Java.

2.4. UTF

Considering that Unicode encoding is not compatible with iso8859-1 encoding and is easier to use, Unicode also requires two bytes for English letters. Unicode is not easy to transmit and store. Therefore, UTF Encoding is produced. UTF Encoding is compatible with iso8859-1 encoding and can also be used to represent characters in all languages. However, UTF Encoding is not long encoding, the length of each character ranges from 1 to 6 bytes. In addition, UTF Encoding comes with a simple verification function. Generally, English letters are represented in one byte, while Chinese characters are represented in three bytes.

Note: Although UTF is used to use less space, it is undoubtedly the most economical to use gb2312/GBK if it is known to be Chinese characters as compared with Unicode encoding. On the other hand, it is worth noting that although UTF uses three bytes for Chinese characters, even for Chinese webpages, UTF Encoding will save compared with Unicode encoding, because the webpage contains many English characters.

3. Java processing of Characters

In Java application software, character set encoding is involved in many cases. In some cases, correct settings are required, and in some cases, certain processing is required.

3.1. getbytes (charset)

This is a standard function for Java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in the Java memory in Unicode encoding. For example, if "Chinese" is stored as "4e2d 6587" under normal circumstances (I .e. when there is no error), if charset is "GBK", it is encoded as "d6d0 cec4 ", then return The Byte "D6 D0 ce C4 ". If charset is "utf8", it is "E4 B8 ad E6 96 87 ". If it is a "iso8859-1", "3f 3f" (two question marks) will be returned because it cannot be encoded ).

3.2. New String (charset)

This is another standard function for Java string processing. In contrast to the previous function, it combines byte arrays according to charset encoding and finally converts them to Unicode storage. Referring to the above getbytes example, "GBK" and "utf8" both can get the correct result "4e2d 6587", but the iso8859-1 finally becomes "003f 003f" (two question marks ).

Because utf8 can be used to represent/encode all characters, new string (Str. getbytes ("utf8"), "utf8") = STR, that is, completely reversible.

3.3. setcharacterencoding ()

This function is used to set the HTTP request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specified, the correct string can be obtained directly through getparameter (). If not specified, the iso8859-1 encoding is used by default and needs further processing. See "form input" below ". It is worth noting that no getparameter () can be executed before setcharacterencoding () is executed (). Java Doc Description: This method must be called prior to reading request parameters or reading input using getreader (). This parameter is only valid for the POST method and invalid for the get method. The cause of the analysis should be that when the first getparameter () is executed, Java will analyze all submitted content according to the encoding, and the subsequent getparameter () will not be analyzed, so setcharacterencoding () invalid. For the form submitted by the get method, the submitted content is in the URL, and all submitted content has been analyzed according to encoding at the beginning. setcharacterencoding () is naturally invalid.

For response, the encoding of the output content is specified. At the same time, this setting is passed to the browser to tell the browser the encoding of the output content.

3.4. handling process

The following two representative examples illustrate how Java handles coding problems.

3.4.1. form input

User input * (GBK: d6d0 cec4) browser * (GBK: d6d0 cec4) Web Server iso8859-1 (00d6 00D 000ce 00c4) class, which needs to be processed in the class: getbytes ("iso8859-1") is D6 D0 ce C4, new string ("GBK") is d6d0 cec4, in memory in Unicode encoding is 4e2d 6587.

L The encoding method entered by the user is related to the page-specific encoding and the user's operating system. Therefore, it is uncertain. The above example uses GBK as an example.

L from browser to Web server, you can specify the character set used for content submission in the form. Otherwise, the encoding specified by the page will be used. What if I use it directly in the URL? Input parameters, the encoding is usually the operating system code, because it is irrelevant to the page. The above uses GBK encoding as an example.

L The Web server receives a byte stream. By default, (getparameter) will be processed in iso8859-1 encoding, and the result is incorrect, so it needs to be processed. However, if the encoding (via request. setcharacterencoding () is set in advance, the correct result can be obtained directly.

L it is a good habit to specify encoding on the page, otherwise it may be out of control and cannot be specified correctly.

3.4.2. File Compilation

Assume that the file is saved by GBK encoding, and there are two encoding options for compilation: GBK or iso8859-1, the former is the default encoding of Chinese Windows, the latter is the default encoding of Linux, you can also specify the encoding during compilation.

JSP * (GBK: d6d0 cec4) Java file * (GBK: d6d0 cec4) compiler read uincode (GBK: 4e2d 6587; iso8859-1: 00d6 00D 000ce 00c4) compiler write UTF (GBK: e4b8ad e69687; iso8859-1: *) compiled file Unicode (GBK: 4e2d 6587; iso8859-1: 00d6 00D 000ce 00c4) class. Therefore, it is not correct to use GBK encoding to save and compile with iso8859-1.

Class Unicode (4e2d 6587) system. Out/JSP. Out GBK (d6d0 cec4) OS console/browser.

L files can be saved in multiple encoding modes. In Chinese Windows, the default value is ANSI/GBK.

L when the compiler reads a file, it needs to get the encoding of the file. If not specified, the system default encoding is used. Generally, the class file is saved in the default encoding of the system, so there will be no compilation problem. However, for JSP files, if they are edited and saved in Chinese windows, they will be deployed in English Linux to run/compile, the problem may occur. Therefore, you must use pageencoding to specify the encoding in the JSP file.

L during Java compilation, it will be converted to a unified unicode encoding process, and then converted to UTF Encoding during storage.

L when the system outputs characters, it will output according to the specified encoding. For Chinese windows. for response (browser), The contenttype specified by the JSP file header is used, or the encoding can be directly specified for response. At the same time, it will tell the browser webpage code. If not specified, iso8859-1 encoding is used. For Chinese characters, the encoding of the output string should be specified for browser.

L when browser displays the webpage, it first uses the encoding specified in response (the contenttype specified in the JSP file header is also reflected in response). If not specified, the contenttype specified by the meta item in the webpage is used.

3.5. Several settings

For Web applications, encoding-related settings or functions are as follows.

3.5.1. jsp Compilation

Specify the storage encoding of the file. Obviously, this setting should be placed at the beginning of the file. For example :. In addition, for general class files, encoding can be specified during compilation.

3.5.2. jsp output

Specifies the encoding used to output the file to the browser. This setting should also be placed at the beginning of the file. For example :. This setting is equivalent to response. setcharacterencoding ("GBK.

3.5.3. Meta settings

Specifies the encoding used by the webpage. This setting is particularly useful for static webpages. Because static Web pages cannot use JSP settings and cannot execute response. setcharacterencoding (). For example:

If both JSP output and meta encoding are used, the JSP encoding takes precedence. Because the content specified by JSP is directly reflected in response.

Note that Apache has a setting that allows you to specify the encoding for a webpage without encoding. This setting is equivalent to the JSP encoding method, so it overwrites the meta specified in the static webpage. Therefore, it is recommended that you disable this setting.

3.5.4. Form settings

When the browser submits a form, you can specify the corresponding encoding. For example:
. Generally, you do not need to use this setting. The browser uses the webpage encoding directly.

4. System Software

The following describes some related system software.

4.1. MySQL database

Obviously, to support multiple languages, you should set the database encoding to UTF or Unicode, while UTF is more suitable for storage. However, Unicode is more suitable if there are few English letters in Chinese data.

The database encoding can be set through the MySQL configuration file, for example, default-character-set = utf8. You can also set it in the database link URL, for example, useunicode = true & characterencoding = UTF-8. Note that both of them should be consistent. In the new SQL version, you can choose not to set the database link URL, but it cannot be an incorrect setting.

4.2. Apache

Appache and encoding-related configuration in httpd. conf, for example, adddefacharcharset UTF-8. As mentioned above, this feature sets the encoding for all static pages to UTF-8, preferably disabling this feature.

In addition, Apache has a separate module to process the webpage Response Header, which may also be used to set the encoding.

4.3. Default Linux Encoding

The default Linux encoding is the runtime environment variable. Two important environment variables are lc_all and Lang. The default encoding affects the behavior of Java urlencode, which is described below.

We recommend that you set it to "zh_CN.UTF-8 ".

4.4. Others

To support Chinese file names, Linux should specify character sets when loading disks, such as Mount/dev/hda5/mnt/hda5/-t ntfs-O iocharset = gb2312.

In addition, the information submitted using the get method does not support request. setcharacterencoding (), but the character set can be specified through the tomcat configuration file. XML file, such :. In this way, all requests are set in a unified manner, but not specific to the specific page. It is not necessarily the same as the encoding used by browser, so sometimes it is not expected.

5. url address

It is very troublesome to include Chinese characters in the URL address. We have previously described how to submit a form using the get method. When using the get method, the parameter is included in the URL.

5.1. url Encoding

The browser automatically encodes some special characters in the URL. Except "/? & ", Including Unicode characters, such as man. The encoding is special.

IE has an option "always use UTF-8 to send URLs", when this option is valid, ie will perform UTF-8 encoding for special characters while URL encoding. If the modification option is invalid, the default encoding "GBK" is used without URL encoding. However, for parameters after a URL, it is always not encoded, which is equivalent to an invalid UTF-8 option. For example, "Chinese .html? A = Chinese ", when the UTF-8 option is valid, will send the link" %e4%b8%ad%e6%96%87.html? A = x4ex2dx65x87 "; when the UTF-8 option is invalid, the link" x4ex2dx65x87.html? A = x4ex2dx65x87 ". Note that the character "Chinese" in front of the latter has only four bytes, but the former has 18 bytes. This is mainly due to URL encoding.

When web server (Tomcat) receives this link, URL Decoding is performed, removing "%" and encoding by ISO8859-1 (as described above, you can use urlencoding to set it to another encoding. The result of the above example is "ue4ub8uadue6u96u87.html? A = u4eu2du65u87 "and" u4eu2du65u87.html? A = u4eu2du65u87 ", note that the" Chinese "character before the former is restored to 6 characters. Here, "U" is used to indicate Unicode.

Therefore, due to different client settings and the same link, different results are obtained on the server. Many people have encountered this problem, but there is no good solution. Therefore, some websites recommend that users try to disable the UTF-8 option. However, the following describes a better solution.

5.2. Rewrite

As we all know, Apache has a powerful rewrite module, which is not described here. It must be noted that this module will automatically decode the URL (remove %) to complete some of the above Web Server (Tomcat) functions. The [NE] parameter can be used to disable this function. However, I did not test the function successfully, probably because of the version (APACHE 2.0.54 is used. In addition, when the parameter contains "? And other symbols, this function will cause the system to fail to get the normal results.

Rewrite itself seems to adopt the byte processing method completely without considering the character string encoding, so it will not bring about Encoding Problems.

5.3. urlencode. encode ()

This is the URL encoding function provided by Java itself, and the work done is similar to the work done by the browser when the above UTF-8 options are valid. It is worth noting that Java does not approve of using this method (Deprecated) without specifying encoding ). Encoding should be added during use.

If no encoding is specified, this method uses the system default encoding, which leads to uncertain software running results. For example, for "Chinese", when the system default encoding is "gb2312", the result is "% 4E % 2D % 65% 87", and the default encoding is "UTF-8 ", the result is "% E4 % B8 % ad % E6 % 96% 87", which will be difficult to handle in the future. In addition, the default system encoding is determined by the environment variables lc_all and Lang when Tomcat is running. Once Tomcat is restarted, garbled characters occur, the last depressing result was that the two environment variables were modified.

It is recommended that you specify it as a "UTF-8" encoding in a unified manner, and you may need to modify the corresponding program.

5.4. One solution

As mentioned above, the web server receives different content for the same link because of different browser settings, and the software system cannot know the difference, therefore, this agreement still has defects.

For specific problems, should not be lucky to think that all the customers of the IE settings are effective UTF-8, should not be rude to suggest users modify iesettings, you need to know, users can not remember every web server settings. So, the next solution is only to make your program a little more intelligent: according to the content to analyze the encoding is UTF-8.

Fortunately, UTF-8 encoding is quite regular, so you can analyze the transmitted link content to determine whether it is the correct UTF-8 character, if yes, it is handled in UTF-8, if not, the use of the customer default encoding (such as "GBK"), the following is an example to determine whether the UTF-8, if you understand the corresponding law, it is easy to understand.

Public static Boolean isvalidutf8 (byte [] B, int amaxcount ){

Int llen = B. length, lcharcount = 0;

For (INT I = 0; I
Byte lbyte = B [I ++]; // to fast operation, ++ now, ready for the following (;;)

If (lbyte> = 0) continue; //> = 0 is normal ASCII

If (lbyte <(byte) 0xc0 | lbyte> (byte) 0xfd) return false;

Int lcount = lbyte> (byte) 0xfc? 5: lbyte> (byte) 0xf8? 4

: Lbyte> (byte) 0xf0? 3: lbyte> (byte) 0xe0? 2:1;

If (I + lcount> llen) return false;

For (Int J = 0; j = (byte) 0xc0) return false;

}

Return true;

}

Correspondingly, an example of using the above method is as follows:

Public static string geturlparam (string astr, string adefaultcharset)

Throws unsupportedencodingexception {

If (astr = NULL) return NULL;

Byte [] lbytes = astr. getbytes ("ISO-8859-1 ");

Return new string (lbytes, stringutil. isvalidutf8 (lbytes )? "Utf8": adefaultcharset );

}

However, this method also has defects in the following two aspects:

L does not include the identification of the user's default encoding. This can be determined based on the language of the request information, but it is not necessarily correct, because sometimes we also enter Korean or other words.

L may incorrectly judge the UTF-8 character, an example is "Learning", its GBK encoding is "xd1xa7xcfxb0", if you use the isvalidutf8 method to judge, will return true. You can consider using more rigorous judgment methods, but the estimation is not very effective.

There is an example to prove that Google has encountered the above problem, and also uses a similar approach, for example, if you enter "http://www.google.com/search? Hl = ZH-CN & newwindow = 1 & Q = learning ", Google will not be able to recognize correctly, while other Chinese characters are generally able to recognize normally.

Finally, it should be added that if the rewrite rule is not used or data is submitted through a form, the above problem may not occur because you can specify the desired encoding when submitting data. In addition, the Chinese file name does cause problems and should be used with caution.

6. Others

The following describes some encoding-related issues.

6.1. securecrt

In addition to coding, browsers and consoles are also related to some clients. For example, when using securecrt to connect to Linux, the display encoding of securecrt should be consistent with that of Linux encoding environment variables (different sessions can have different encoding settings. Otherwise, some help information may be garbled.

In addition, MySQL has its own encoding settings and should also be consistent with the display encoding of securecrt. Otherwise, Chinese characters may not be processed when SQL statements are executed using securecrt, and garbled characters may appear in the query results.

For Utf-8 files, many editors (such as NotePad) add three invisible flag bytes at the beginning of the file, and if you are an MySQL input file, you must remove these three characters. (You can remove these three characters when saving them with Linux VI ). An interesting phenomenon is that in Chinese Windows, create a new TXT file, open it in notepad, enter the word "Connect", save it, and open it again. You will find that the two words are gone, only one small black spot is left.

6.2. Filter

If you need to set the encoding in a unified manner, it is a good choice to set the encoding through the filter. In filter class, encoding can be set for all requests or responses. Participate in the above setcharacterencoding (). This type of Apache has provided an example that can be directly used: setcharacterencodingfilter.

6.3. Post and get

Obviously, when submitting information via post, the URL is more readable, and setcharacterencoding () can be easily used to handle Character Set problems. However, the URL formed by the get method can easily express the actual content of the webpage and be used for favorites.

From a unified perspective, we recommend that you use the get method, which requires special processing to obtain parameters in the program, rather than using setcharacterencoding (). If you do not consider rewrite, there is no UTF-8 problem with IE, you can consider setting uriencoding to easily get parameters in the URL.

6.4. Simplified and Traditional Chinese encoding conversion

GBK also contains both simplified and traditional Chinese characters. That is to say, the same word belongs to two characters under GBK encoding because of different codes. Sometimes, in order to get the complete results correctly, we should unify the Traditional Chinese and simplified Chinese. You can consider converting all traditional Chinese characters in UTF and GBK into simplified Chinese characters. big5 encoding data should also be converted into simplified Chinese characters. Of course, it is still stored in UTF Encoding.

For example, for "文", UTF is used to indicate "xe8xafxadxe8xa8x80 xe8xaax9exe8xa8x80". After simplified and traditional code conversion, two identical "xe8xafxadxe8xa8x80> ".
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.