(EXT) Java character encoding conversion

Source: Internet
Author: User
Tags function prototype locale xml parser
Jvm
After the JVM is started, the JVM sets some system properties to indicate the default area of the JVM.
User.language,user.region,file.encoding and so on. You can use System.getproperties () to view all system properties in detail.
For example, under an English operating system such as UNIX, you can use the following property definitions to force the specified JVM to-dclient.encoding.override=gbk-dfile.encoding=gbk-duser.language=zh the Chinese environment- Duser.region=cn
. Java-->.class Compilation
Note: The general Javac automatically determines the encoding of the source file according to the current OS locale. You can force the designation by-encoding.
Error May:
1 GBK encoded source files are compiled in English environment, Javac cannot be converted correctly. It was found in java/jsp under English Unix. Detection method: Write/u4e00 format of Chinese characters, bypassing Javac encoding, and then in the JVM, the Chinese characters as an int to print, see whether the value is equal, or directly with UTF-8 encoding open the. class file to see if the constant string correctly saves the Chinese character.
File Read and Write
External data such as files read and write and converted two steps to the JVM using the characters. Inputstream/outputstream is used to read and write raw external data, Reader/writer perform two steps for reading and writing and converting.
1 file read-write conversion by java.io.reader/writer execution, input output stream inputstream/outputstream processing Chinese characters inappropriate, should be preferred to use reader/writer, such as filereader/ FileWriter.
2 Filereader/filewriter reads and writes files using the JVM's current encoding. If there are other encoding formats, use the Inputstreamreader/outputstreamwriter
The 3 PrintStream is somewhat special, and it automatically converts using the JVM default encoding.
Read. properties file
The. propeties file is read by the properties class in Iso8859-1 encoding, so it is not possible to write Chinese characters directly, and you need to use the JDK's NATIVE2ASCII tool to convert the kanji to the/uxxxx format. Command line: NATIVE2ASCII encoding GBK inputfile outputfile
reading XML files
1 XML files read and write with the file read and write, but should be careful to ensure that the XML header declared in the. The XML version= "1.0" encoding= "gb2312"?> is consistent with the file encoding.
2 Javax.xml.SAXParser class Accept InputStream as input parameters, for reader, need to use Org.xml.sax.InputSource packaging, and then to SAXParser.
3 for UTF-8 encoded XML, note that the editor is prevented from automatically adding/ufffe BOM headers, and the XML parser reports the content is not allowed in Prolog.
byte array
1 converting between byte arrays and strings using the new string (bytearray,encoding) and string.getbytes (encoding)
It can also be converted using Bytearrayinputstream/bytearrayoutputstream into a stream and then Inputstreamreader/outputstreamwriter.
Incorrectly encoded string (iso8859-1 transcoding GBK)
If the string we get is generated by the wrong transcoding, for example: for GBK Chinese, converted by iso8859-1, at which point the string you see with the debugger is generally the same as the length of the text, rather than the number of characters.
Can be converted to the correct Chinese in the following ways:
Text = new String (text.getbytes ("iso8859-1"), "GBK");

web/servlet/jsp
1 for the JSP, determine the header plus <%@ page contenttype= "text/html;charset=gb2312"%> such a label.
2 for servlet, determine set setContentType ("text/html; charset=gb2312 "), the above two is used to make the output of Chinese characters no problem.
3 Adds a <meta http-equiv= "Content-type" content= "text/html" to the output HTML head; charset=gb2312 >, let the browser correctly determine the HTML encoding.
4 Add a filter to the Web application to ensure that each request explicitly calls the Setcharacterencoding method so that the input characters can be parsed correctly.
Import java.io.IOException;
Import Javax.servlet.Filter;
Import Javax.servlet.FilterChain;
Import Javax.servlet.FilterConfig;
Import javax.servlet.ServletException;
Import Javax.servlet.ServletRequest;
Import Javax.servlet.ServletResponse;
Import javax.servlet.UnavailableException;
Import Javax.servlet.http.HttpServletRequest;
/**
* Example filter that sets the character encoding to is used in parsing the
* Incoming Request
*/
public class Setcharacterencodingfilter
Implements Filter {
Public Setcharacterencodingfilter ()
{}
Protected Boolean debug = false;
protected String encoding = NULL;
protected Filterconfig filterconfig = null;
public void Destroy () {
this.encoding = null;
This.filterconfig = null;
}
public void Dofilter (ServletRequest request, servletresponse response,
Filterchain chain) throws IOException, Servletexception {
if (request.getcharacterencoding () = null)
// {
String encoding = GetEncoding ();
if (encoding!= null)
request.setcharacterencoding (encoding);
//
// }
request.setcharacterencoding (encoding);
if (Debug) {
System.out.println ((httpservletrequest) request). Getrequesturi () + "setted to" +encoding);
}
Chain.dofilter (request, response);
}
public void init (Filterconfig filterconfig) throws Servletexception {
This.filterconfig = Filterconfig;
this.encoding = Filterconfig.getinitparameter ("encoding");
This.debug = "true". Equalsignorecase (Filterconfig.getinitparameter ("Debug"));
}
Protected String getencoding () {
return (this.encoding);
}
}
Web.xml Add:
<filter>
<filter-name>LocalEncodingFilter</filter-name>
<display-name>LocalEncodingFilter</display-name>
<filter-class>com.ccb.ectipmanager.request.SetCharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>gb2312</param-value>
</init-param>
<init-param>
<param-name>debug</param-name>
<param-value>false</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>LocalEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
5 for WebLogic (vedor-specific):
One: Add the following script to the Web.xml:
<context-param>
<param-name>weblogic.httpd.inputCharset./*</param-name>
<param-value>GBK</param-value>
</context-param>
Second (optional) Add the following script to the Weblogic.xml:
<charset-params>
<input-charset>
<resource-path>/*</resource-path>
<java-charset-name>GBK</java-charset-name>
</input-charset>
</charset-params>
Swing/awt/swt
For Swing/awt,java There are default fonts such as Dialog/san Serif, which are mapped to the actual font of the system specified in the $jre_home/lib/font.properties.xxx file. When you troubleshoot font display problems, you first need to determine that the JVM's area is ZH_CN so that the Font.properties.zh_CN file does not work. For Font.properties.zh_CN, you need to check to see if the default font is mapped to Chinese fonts such as XXFarEastFont-Arial.
In swing, Java interprets the TTF font, renders the display, and AWT,SWT the display part to the operating system. First you need to make sure that the system contains Chinese fonts.
1 Chinese characters are displayed as "-", generally the display font does not use the Chinese font, because Java for the current font can not display characters, will not be the same as Windows in the default font display.
2 parts are not common Chinese characters can not be displayed, generally for the display of Chinese characters are not complete, you can change another Chinese font try.
3 for AWT/SWT, first determine the locale to which the JVM is running is Chinese, because it is designed to transform the JVM and operating system API calls, and then check for other issues.
Jni
The jstring in JNI is encoded to us by UTF-8, and we need to convert ourselves to local code. For Windows, you can use the Widechartomultibyte/multibytetowidechar function for conversion, and for UNIX, you can use the Iconv library.
Here we find a section of the Sun JDK 1.4 source code for a getbytes conversion using a JVM String object that is relatively simple and cross-platform and does not require a third-party library, but is slightly slower. The function prototype is as follows:
/* Convert between Java strings and i18n C strings * *
Jniexport jstring
Newstringplatform (jnienv *env, const char *STR);
Jniexport Const char *
Getstringplatformchars (jnienv *env, jstring jstr, Jboolean *iscopy);
Jniexport jstring Jnicall
Jnu_newstringplatform (jnienv *env, const char *STR);
Jniexport Const char * jnicall
Jnu_getstringplatformchars (jnienv *env, jstring jstr, Jboolean *iscopy);
Jniexport void Jnicall
Jnu_releasestringplatformchars (jnienv *env, jstring jstr, const char *STR);
Annex JNI_UTIL.H,JNI_UTIL.C
jdk1.4/1.5 new section
Character Set-related classes (Charset/charsetencoder/charsetdecoder)
jdk1.4 begins, support for character sets is implemented in the Java.nio.charset package.
Common functions:
1 List the supported character sets for the JVM: Charset.availablecharsets ()
2 whether to read a Unicode character encoding, Charsetencoder.canencode ()
Problems
Under the JVM, the Chinese is not printed correctly with System.out.println, which is displayed as???
System.out.println is PrintStream, which uses the JVM default character set for transcoding, and if the JVM's default character set is Iso8859-1, the Chinese display will be problematic. This problem is common in Unix, where the JVM's area is not explicitly specified.
In the English UNIX environment, the Chinese character can be printed correctly with SYSTEM.OUT.PRINTLN, but the internal processing error
It may be that the Chinese characters do not have the correct transcoding when the input is converted:
That is, the gbk text à (iso8859-1 transcoding) ÀJVM char (iso8859-1 encoded kanji) à (iso8859-1 transcoding) à output.
GBK Chinese characters pass two error transcoding, are passed to the output unchanged, but in the JVM, not in the correct Unicode encoding, but as a byte of a char in the way represented, resulting in such errors.
gb2312-80,gbk,gb18030-2000 Chinese Character Set
GB2312-80 is developed at the initial stage of the development of Chinese computer Information technology, which contains most commonly used secondary characters and 9-area symbols. The character set is the Chinese character set supported by almost all Chinese systems and internationalized software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, Low is also 0xa1-0xfe; Chinese characters start from 0xb0a1 and end in 0xf7fe;
GBK is an extension of gb2312-80 and is up-compatible. It contains 20,902 Chinese characters, and its coding range is 0x8140-0xfefe, which eliminates the position of high 0x80. All of its characters can be mapped one-to-one to Unicode 2.0, which means that JAVA actually provides support for the GBK character set. This is the default character set for Windows and some other Chinese operating systems at this stage, but not all internationalized software supports the character set, and it feels like they don't fully know what's going on with GBK. It is noteworthy that it is not a national standard, but only norms. With the release of gb18030-2000 GB, it will complete its historical mission in the near future.
gb18030-2000 (GBK2K) further expands the Chinese characters on the basis of GBK, and increases the glyphs of Tibetan and Mongolian minorities. GBK2K fundamentally solves the problem of insufficient character and short shape. It has several features,
It does not determine all glyphs, but only defines the coding range, which is left for later expansion.
The encoding is variable length, the second byte part is compatible with GBK, the four-byte part is the expanded glyph, the word bit, its encoding range is the first byte 0x81-0xfe, two byte 0x30-0x39, three byte 0x81-0xfe, four byte 0x30-0x39.
Utf-8/utf-16/utf-32
UTF, the Unicode Transformer Format, is the actual representation of the Unicode code point, which is divided into utf-8/16/32 by the number of digits in its base length. It can also be considered a special kind of external data encoding, but can do one by one correspondence with Unicode code points.
UTF-8 is a variable-length encoding in which each Unicode code point can have a different length of 1-3 bytes depending on the range.
UTF-16 lengths are relatively fixed, and each Unicode code point uses 16-bit, 2-byte representations, as long as it does not handle characters larger than/u200000 range, with two UTF-16, or 4 bytes, in excess. According to the high and low byte order, but also divided into utf-16be/utf-16le.
The UTF-32 length is always fixed, with 32-bit, 4-byte representations per Unicode code point. According to the high and low byte order, but also divided into utf-32be/utf-32le.
UTF encoding has the advantage that, although the number of encoded bytes, but not like the GB2312/GBK encoding, the need to start looking from the text in order to correctly position the Chinese character. Under UTF coding, according to the relative fixed algorithm, it is possible to know from the current position whether the current byte is the beginning or the end of a code point, so as to make the character positioning relatively simple. However, the most simple positioning problem is UTF-32, it does not require character positioning, but the relative size also increased a lot.
About the GCJ JVM
GCJ is not fully in accordance with the Sun JDK approach, the region and coding issues are not considered comprehensive. When GCJ is started, the zone is always set to en_US and the encoding defaults to iso8859-1. But you can use Reader/writer to do the correct coding conversion.


Content from: http://blog.csdn.net/yuanyuan110_l/archive/2008/01/21/2057658.aspx

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.