Java character encoding and conversion

Last Update:2018-12-05 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

JVM
After JVM is started, JVM sets some system attributes to indicate the default regions of JVM.
User. Language, user. region, file. encoding, etc. You can use system. getproperties () to view all System Properties in detail.
For example, in an English operating system (such as UNIX), you can use the following attribute definition to forcibly specify JVM as a Chinese environment-dclient. encoding. override = GBK-dfile. encoding = GBK-duser. language = ZH-duser. region = Cn
. Java -->. Class Compilation
Note: Generally, javac automatically determines the source file encoding based on the current OS region settings. You can use-encoding to forcibly specify the source file encoding.
The error may be:
The 1 GBK encoding source file is compiled in an English environment, and javac cannot be converted correctly. I have seen Java/JSP in English UNIX. test method: Write/u4e00 format of Chinese characters, bypass javac encoding, and then in JVM, the Chinese characters as int printing, to see whether the value is equal; or directly in UTF-8 encoding open. class file to check whether the constant string correctly stores Chinese characters.
File read/write
The external data, such as the characters used by JVM, is converted into two steps: read/write and conversion. Inputstream/outputstream is used to read and write the original external data. Reader/writer performs the read/write and conversion steps.
1. file read/write conversion is performed by Java. Io. Reader/writer. The input/output streams inputstream/outputstream are not suitable for processing Chinese characters. Reader/writer should be preferred, such as filereader/filewriter.
2 filereader/filewriter use the current JVM encoding to read and write files. If there are other encoding formats, use inputstreamreader/outputstreamwriter
3 printstream is a bit special. It automatically uses JVM default encoding for conversion.
Read the. properties File
The. propeties file is read by the properties class in iso8859-1 encoding, so you cannot directly write Chinese characters in it, you need to use JDK native2ascii tool to convert Chinese characters to/uxxxx format. Command Line: native2ascii? Encoding GBK inputfile outputfile
Read XML files
1. the XML file reads and writes are the same as the file reads and writes, but make sure that the xml header is as clear as <? XML version = "1.0" encoding = "gb2312"?> Consistent with the file encoding.
2 The javax. xml. saxparser class accepts inputstream as the input parameter. For reader, you need to wrap it with org. xml. Sax. inputsource and then give it to saxparser.
3 For UTF-8 encoding XML, be sure to prevent the editor from automatically adding the/ufffe BOM header, XML parser will report content is not allowed in Prolog.
Byte array
1 use new string (bytearray, encoding) and string. getbytes (encoding) to convert between byte arrays and strings
You can also convert bytearrayinputstream/bytearrayoutputstream to a stream and then use inputstreamreader/outputstreamwriter to convert the stream.
Error-encoded string (iso8859-1 transcoding GBK)
If the string we get is produced by the wrong transcoding method, for example, for GBK Chinese, converted by the iso8859-1 method, if the string we see with the debugger is generally like, the length is generally the byte length of the text, not the number of Chinese characters.
You can use the following method to convert to the correct Chinese:
TEXT = new string (text. getbytes ("iso8859-1"), "GBK ");

Web/servlet/jsp
1. for JSP, make sure that the <% @ page contenttype = "text/html; charset = gb2312" %> label is added to the header.
2. For servlet, make sure to set setcontenttype ("text/html; charset = gb2312"). The above two are used to make the output of Chinese characters no problem.
3. Add <meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"> to the output HTML head, so that the browser can correctly determine the HTML encoding.
4. Add a filter to the Web application to ensure that each request explicitly calls the setcharacterencoding method, so that the input Chinese characters can be correctly parsed.
Import java. Io. ioexception;
Import javax. servlet. filter;
Import javax. servlet. filterchain;
Import javax. servlet. filterconfig;
Import javax. servlet. servletexception;
Import javax. servlet. servletrequest;
Import javax. servlet. servletresponse;
Import javax. servlet. unavailableexception;
Import javax. servlet. http. httpservletrequest;
/**
* Example filter that sets the character encoding to be used in parsing
* Incoming request
*/
Public class setcharacterencodingfilter
Implements filter {
Public setcharacterencodingfilter ()
{}
Protected Boolean DEBUG = false;
Protected string encoding = NULL;
Protected filterconfig = NULL;
Public void destroy (){
This. Encoding = NULL;
This. filterconfig = NULL;
}
Public void dofilter (servletrequest request, servletresponse response,
Filterchain chain) throws ioexception, servletexception {
// If (request. getcharacterencoding () = NULL)
//{
// String encoding = getencoding ();
// If (encoding! = NULL)
// Request. setcharacterencoding (encoding );
//
//}
Request. setcharacterencoding (encoding );
If (Debug ){
System. Out. println (httpservletrequest) request). getrequesturi () + "setted to" + encoding );
}
Chain. dofilter (request, response );
}
Public void Init (filterconfig) throws servletexception {
This. filterconfig = filterconfig;
This. Encoding = filterconfig. getinitparameter ("encoding ");
This. DEBUG = "true". inclusignorecase (filterconfig. getinitparameter ("debug "));
}
Protected string getencoding (){
Return (this. Encoding );
}
}
Add the following content to Web. xml:
<Filter>
<Filter-Name> localencodingfilter </filter-Name>
<Display-Name> localencodingfilter </display-Name>
<Filter-class> com. CCB. ectipmanager. Request. setcharacterencodingfilter </filter-class>
<Init-param>
<Param-Name> encoding </param-Name>
<Param-value> gb2312 </param-value>
</Init-param>
<Init-param>
<Param-Name> debug </param-Name>
<Param-value> false </param-value>
</Init-param>
</Filter>
<Filter-mapping>
<Filter-Name> localencodingfilter </filter-Name>
<URL-pattern>/* </url-pattern>
</Filter-mapping>
5 For WebLogic (vedor-specific ):
1. Add the following script to Web. xml:
<Context-param>
<Param-Name> weblogic. httpd. inputcharset./* </param-Name>
<Param-value> GBK </param-value>
</Context-param>
(Optional) Add the following script to weblogic. xml:
<Charset-Params>
<Input-charset>
<Resource-path>/* </resource-path>
<Java-charset-Name> GBK </Java-charset-Name>
</Input-charset>
</Charset-Params>
Swing/AWT/SWT
For swing/AWT, Java will have some default fonts such as dialog/san serif. The ing of these fonts to the system's real fonts is specified in the $ jre_home/lib/font. properties. xxx file. To exclude the font display problem, you must first determine that the JVM region is zh_cn so that the font. properties. zh_cn file will function. For font. properties. zh_cn, check whether the default font is mapped to a Chinese font, such as.
In swing, Java automatically interprets the TTF font and renders the display. For AWT, The SWT display part is handed over to the operating system. First, make sure that the system has a Chinese font.
1. The Chinese character is displayed as "□". Generally, the Chinese font is not used for the display font. Because Java does not use the default font for characters that cannot be displayed in the current font like windows.
2. Some uncommon Chinese characters cannot be displayed. Generally, the Chinese characters in the font are incomplete. You can try another Chinese font.
3. For AWT/SWT, first set the JVM runtime environment region to Chinese, because the conversion problem between JVM and operating system API calls is designed here, and then check other problems.
JNI
Jstring in JNI is coded to us by UTF-8, and we need to convert it to local encoding. For Windows, you can use the widechartomultibyte/multibytetowidechar function for conversion. For UNIX, you can use the iconv library.
In the source code of Sun JDK 1.4, we can find a getbytes conversion method that uses JVM string objects. This method is relatively simple and cross-platform, and does not require a third-party library, but it is a little slow. The function prototype is as follows:
/* Convert between Java strings and i18n C strings */
Jniexport jstring
Newstringplatform (jnienv * ENV, const char * Str );
Jniexport const char *
Getstringplatformchars (jnienv * ENV, jstring jstr, jboolean * iscopy );
Jniexport jstring jnicall
Jnu_newstringplatform (jnienv * ENV, const char * Str );
Jniexport const char * jnicall
Jnu_getstringplatformchars (jnienv * ENV, jstring jstr, jboolean * iscopy );
Jniexport void jnicall
Jnu_releasestringplatformchars (jnienv * ENV, jstring jstr, const char * Str );
Attachment jni_util.h, jni_util.c
Jdk1.4/1.5 add part
Character Set-related classes (charset/charsetencoder/charsetdecoder)
JDK and later versions support character sets in the Java. NiO. charset package.
Common functions:
1. List JVM supported character sets: charset. availablecharsets ()
2. Can you view the Unicode character encoding charsetencoder. canencode ()
FAQs
In JVM, system. Out. println cannot print Chinese characters correctly ???
System. Out. println is printstream, which uses the JVM default character set for transcoding. If the JVM's default character set is iso8859-1, there is a problem with the Chinese display. This problem is common in UNIX where the JVM region is not explicitly specified.
In an English Unix environment, system. Out. println can print Chinese characters correctly, but internal processing errors
It may be that Chinese characters are not correctly transcoded during input conversion:
That is, GBK text à (iso8859-1 transcoding) à JVM char (iso8859-1 encoding Chinese characters) à (iso8859-1 transcoding) à output.
GBK Chinese characters have been incorrectly transcoded twice and are passed to the output without being blocked. However, in JVM, they are not correctly encoded in UNICODE, but are represented in the form of a Chinese character byte and a char, this type of error occurs.
GB2312-80, GBK, GB18030-2000 Chinese Character Set
GB2312-80 is made in the initial stage of the development of Chinese character information technology in China, which contains most of the commonly used first-and second-level Chinese characters, and 9-area symbols. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set. The encoding range is high: 0xa1-0xfe; low: 0xa1-0xfe; Chinese characters start from 0xb0a1 and end with 0xf7fe;
GBK is an extension of the GB2312-80 and is upward compatible. It contains 20902 Chinese characters and Its Encoding range is 0x8140-0xfefe, excluding the characters with a high position of 0x80. All its characters can be mapped to Unicode 2.0 one-to-one. That is to say, Java actually supports the GBK character set. This is the default character set for Windows and some other Chinese operating systems at present, but not all international software support this character set. It seems that they do not fully understand what GBK is. It is worth noting that it is not a national standard, but a standard. With the launch of the GB18030-2000 national mark, it will fulfill its historical mission in the near future.
On the basis of GBK, GB18030-2000 (gbk2k) further expands Chinese characters and adds the fonts of Tibetan and Mongolian ethnic minorities. Gbk2k fundamentally solves the problem of insufficient characters and insufficient fonts. It has several features,
It does not determine all the glyphs, but only specifies the encoding range, which will be extended later.
The encoding is variable, and the second part is compatible with GBK. The four-byte part is the expanded font and character bit, the encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, and four byte 0x30-0x39.
UTF-8/UTF-16/UTF-32
UTF, the Unicode transformer format, is the actual representation of the Unicode Code Point, divided into UTF-8/16/32 by the number of digits of its basic length. It can also be considered as a special external data encoding, but it can be one-to-one correspondence with Unicode code points.
The UTF-8 is variable-length encoding, and each Unicode code point can have 1-3 bytes of different lengths according to different ranges.
The length of the UTF-16 is relatively fixed, as long as the characters in the/u200000 range are not processed, each Unicode code point is represented in 16-bit, 2-byte, and the excess is represented in two UTF-16, 4-byte. According to the high and low byte order, is divided into UTF-16BE/UTF-16LE.
The UTF-32 length is always fixed, and each Unicode code point is represented in 32-bit, 4-byte. According to the high and low byte order, is divided into UTF-32BE/UTF-32LE.
UTF Encoding has the following advantages: although the number of encoded bytes is not the same as that of gb2312/GBK encoding, you must start from the text to locate Chinese characters correctly. In UTF Encoding, based on the relatively fixed algorithm, you can know from the current position whether the current byte is the beginning or end of a code point, so that you can easily locate the character. But the most simple positioning problem is UTF-32, it does not need to be character positioning, but the relative size is also increased a lot.
About gcj JVM
Gcj is not completely in accordance with Sun JDK's practice, and it is not well-considered for the region and encoding issues. When gcj is started, the region is always set to en_us, And the encoding defaults to iso8859-1. However, you can use Reader/writer for correct encoding conversion.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More