Java Character encoding Summary

Last Update:2014-08-25 Source: Internet

Author: User

Tags string back

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

String newstr = new String (Oldstr.getbytes (), "UTF-8");

The string class in Java is encoded in Unicode, and when a string is constructed using string (byte[] bytes, string encoding), encoding refers to the data in bytes encoded in that way, Instead of the last generated string, what is encoded, in other words, is to have the system convert the data in bytes from encoding encoding to Unicode encoding. If not specified, the bytes encoding will be determined by the JDK based on the operating system.

When we read data from a file, it is best to use the InputStream method and then use String (byte[] bytes, string encoding) to indicate how the file is encoded. Do not use reader mode, because the reader method automatically converts the file content to Unicode encoding based on the encoding specified by the JDK.

When we read the text data from the database, we use the Resultset.getbytes () method to get the byte array, and also adopt the string construction method with encoding.

ResultSet rs;
bytep[] bytes = Rs.getbytes ();
String str = new string (bytes, "gb2312");

Do not take the following steps.

ResultSet rs;
String str = rs.getstring ();
str = new String (str.getbytes ("iso8859-1"), "gb2312");

This type of encoding translates into an efficient bottom. The reason for this is that the data in the default database is encoded as iso8859-1 when ResultSet is executed in the GetString () method. The system converts the data into Unicode according to the Iso8859-1 encoding method. Use Str.getbytes ("iso8859-1") to restore the data and then use the new String (bytes, "gb2312") to convert the data from gb2312 to Unicode, with a lot more steps in between.

When reading parameters from HttpRequest, the Reqeust.setcharacterencoding () method is used to set the encoding method, and the content read is correct.

Say Java first.
Any string resource inside the JVM is Unicode, meaning that any string type of data is Unicode encoded. No exceptions. Since there is only one encoding, we can say that the string inside the JVM is not encoded. string equivalent to char[].
The byte[] data inside the JVM is encoded. Like, big5,gbk,gb2312,utf-8 or something like that.
A GBK encoded byte[] is converted to a String, in fact, from GBK encoding to Unicode encoding.
A string converted to a BIG5 encoded byte[] is actually converted from Unicode encoding to BIG5 encoding.
Therefore, Unicode is the intermediate medium for all encoding conversions. All encodings have one converter that can be converted to Unicode, and Unicode can be converted to all other encodings. This constitutes a bus structure.
For example, if you have a total of 10 encodings, then just 10 + 10 = 20 Converters is enough. If 22 is converted directly, then the number of converters required is a combined number and requires 90 converters.

The different parts of a system, all have their own code. For example, database, file, JVM, browser these 4 parts.
Where data is exchanged between these parts, there is a coding problem. For example, between a database and a JVM, between a file and a JVM, between a browser and a JVM. The principles of these problems are interlinked.

The easiest place to deal with coding problems is between the file and the JVM. The File IO API has encoding parameters, please check it yourself.
The least likely coding problem is between the database and the JVM. This should be the basic functionality of the database JDBC connection. This article is not devoted to discussion.
The easiest place to go is between the browser and the server JVM (in fact, the strings inside the code are more prone to problems, but I have already declared that this article does not discuss the string encoding in the code). The following is a discussion of the coding problem between this browser and the server JVM.

We call the browser code Browser_charset, which is called the JVM code jvm_charset (usually equal to the server system code).
When the data of the browser comes over, it is a byte[with Browser_charset].
If a user handler requires a string type of data, the JVM will kindness this byte[] to a string. The converter used is Jvm_charset, Unicode.
Note that if this time, Browser_charset and jvm_charset are not equal. Well, this automatic conversion is wrong.
To make up for this mistake. We need to do two-step work.
(1) Unicode-Jvm_charset, convert this string back to the original byte[].
(2) Browser_charset, Unicode, convert this restored byte[] to String.

This effect, and get byte[directly from HTTP request], and then perform (2) Browser_charset-a Unicode effect is the same.

If the characterencoding is set in the request, then the post data parameter does not need to be converted manually, and the automatic conversion of the Web server is correct. URL encoding also involves URL coding, there are more issues to consider, not so simple.

When the JVM sends the data to the browser. You also need to consider coding issues. Can be set inside the response. In addition, the HTML Meta header can also be set to encode, reminding browser to choose the correct encoding.

Some languages may have different string encodings for VMS or interpreters. For example, Ruby. However, the encoding conversion principle is the same.

That's all.

Java character encoding

I. Summary
In Java applications, especially web-based programs, character encoding problems are frequently encountered. To prevent garbled characters, you first need to understand how Java handles the character, so that you can purposefully add the necessary transcoding to the input/output link. Second, because the various servers have different processing methods, but also need to do more testing, to ensure that the use does not appear garbled.
Ii. Basic Concepts
2. 1 expression of characters in Java
There are several concepts of char, Byte, and string in JAVA. Char refers to a Unicode character, which is a 16-bit integer. Byte is a byte, and the string needs to be converted to a byte array before the network is transmitted or stored. A byte array needs to be converted to a string after it is received from the network or read from the storage device. String is a string that can be viewed as an array of char. String and char are memory forms, and byte is the serialized form of network transport or storage.
Example:
British
String ying = "English";
Char ying = ying.charat (0);
String Yinghex = integer.tohexstring (ying);
F1
byte yinggbbytes = ying.getbytes ("GBK");
GB-encoded byte value
D3 A2
2. 2 Introduction to coding methods
String serialized into a byte array or deserialized requires the correct encoding to be selected. If the encoding is not correct, you will get some 0x3f values. The commonly used character encoding methods are Iso8859_1, GB2312, GBK, utf-8/utf-16/utf-32.
Iso8859_1 is used to encode Latin, which consists of a single byte (0-255).
GB2312, GBK is used to encode Simplified Chinese, it has a single-byte and double-byte mix composition. The byte with the highest bit 1 and the next byte form a Chinese character, and the highest byte of 0 is the ASCII code.
UTF-8/UTF-16/UTF-32 is the international standard Unicode encoding method. The most used is UTF-8, mainly because it saves space when coding Latin.
Unicode Value UTF-8 encoding
U-00000000-u-0000007f:0xxxxxxx
U-00000080-u-000007ff:110xxxxx 10xxxxxx
U-00000800-u-0000ffff:1110xxxx 10xxxxxx 10xxxxxx
U-00010000-u-001fffff:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000-U-03FFFFFF:111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
u-04000000-u-7fffffff:1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Iii. related functions in J2SE
String str = "English";
Get GB2312 Encoded bytes
byte[] bytesGB2312 = str.getbytes ("GB2312");
Gets the platform default encoding bytes (Solaris is iso8859_1,windows to GB2312)
byte[] Bytesdefault = Str.getbytes ();
Converts a byte into a string with the specified encoding
String Newstrgb = new String (bytesGB2312, "GB2312");

Convert bytes to strings using platform default encoding (Solaris is iso8859_1,windows to GB2312)
String newstrdefault = new string (Bytesdefault);
Reads a character from a byte stream with the specified encoding
InputStream in = xxx;
InputStreamReader reader = InputStreamReader (in, "GB2312");
char Achar = Reader.read ();
Iv. coding of JSP and database
4. 1 encoding in the JSP
(1) Static declaration:
CharSet has two functions:
How to encode JSP files: When reading JSP files and generating Java classes, the encoding of Chinese characters in the source JSP file
JSP output stream encoding: When executing a JSP, the encoding of the data written to the response stream
(2) Dynamic change: You can call Response.setcontenttype () to set the correct encoding type before you write data to the response stream.
(3) in Tomcat, the parameters obtained by Request.getparameter () are encoded in iso8859_1. So if you enter a Chinese character "English" in the browser input box, you get a iso8859_1 encoded (0X00,0XD3,0X00,0XA2) on the server side. Therefore, the code is usually transcoded when the parameter is received:
String wrongstr = response.getparameter ("name");
String correctstr = new String (Wrongstr.getbytes ("Iso8859_1"), "GB2312");
In the latest servlet specification, you can also execute the following code before getting the parameters:
Request.setcharacterencoding ("GB2312");
4. 2 Encoding of the database
(1) Database use UTF-16
If the string is a Unicode character, no transcoding is required when the write is read out
(2) database use Iso8859_1
If the string is a Unicode character, a transcoding is required when the write is read out
Write: String newstr = new String (Oldstr.getbyte ("GB2312"), "iso8859_1");
Read out: String newstr = new String (Oldstr.getbyte ("Iso8859_1"), "GB2312");
Five, the source file encoding
5. 1 Resource Files
The resource file is encoded in relation to the editing platform. A resource file written under the Windows platform, encoded in GB2312 mode. Transcoding is required at compile time to ensure correctness on each platform:
Native2ascii? Encoding GB2312 source.properties
This reads the correct Unicode string from the resource file.
5. 2 Source files
The encoding of the source file is related to the editing platform. Source files developed under the Windows platform are encoded in GB2312 mode. At compile time, you need to specify how the source files are encoded:
Javac? Encoding GB2312
The encoding of the byte file generated by Java compilation is UTF-8.

① Latest Version TOMCAT4.1.18 support request.setcharacterencoding (String ENC)
② resource file transcoding into company.name=u82f1u65afu514b
③ This part of the transcoding is not required if the database is using utf-16
The ④ page should have
Transcoding?:
string s = new string
(Request.getparameter ("name"). GetBytes ("Iso8859_1"), "GB2312");
Transcoding?:
string s = new string (Name.getbytes ("GB2312"), "iso8859_1");
Transcoding?:
string s = new string (Name.getbytes ("Iso8859_1"), "GB2312");

======================================================

What character encoding does Ava have inside? This problem I also looked for a long time, later in the think in JAVA 3rd 12 Chapter saw an example appeared utf-16be, is it?
byte[] Utf_16be = name.getbytes ("Utf-16be");

Printbyte (UTF_16BE);

The result: Length = 2
Haha, I got it! Not a lot of two bytes, the content is the same. Sure enough it was. I also see in the inside, Unicode encoding there is a le, here be,le I think it should be Bigendian and Littleendian bar.

=====================================================

1Import java.io.*; 2  Public classTestcodeio {3      Public Static voidMain (string[] args) throws exception{4InputStreamReader ISR =NewInputStreamReader (System.inch,"iso8859-1"); 5         //Create an InputStreamReader that uses the given CharSet decoder6BufferedReader br =NewBufferedReader (ISR); 7String StrLine =Br.readline (); 8 Br.close (); 9 Isr.close (); TenSystem. out. println (StrLine);  OneSystem. out. println (NewString (Strline.getbytes (),"iso8859-1"));//Error Modification Method A         //encodes this String (strLine) into a sequence of bytes using the platform ' s -         //default CharSet (gb2312) then constructs a new String by decoding the -         //specified array of bytes using the specified charset (iso8859-1) the         //because this String (StrLine) uses the charset decoder "iso8859-1", so it can -         //Only being encoded by "iso8859-1", Cann ' t is encoded by the platform ' s default -         //CharSet "gb2312", so this is wrong.  -System. out. println (NewString (Strline.getbytes ("iso8859-1")));//correct method of modification +         //encodes this String (strLine) into a sequence of bytes using the named -         //CharSet (iso8859-1), then constructs a new String by decoding the +         //specified array of bytes using the platform ' s default CharSet (gb2312).  A         //This was right .  at     }      -}

The English note above has been made very clear, here I would explain it:

The first is the wrong method of System.out.println (New String (Strline.getbytes (), "iso8859-1"));
This code is the system default encoding for strings in strline (this is gb2312)
into a sequence of bytes, and then constructs a new one with the specified encoding (this is iso8859-1)
String object and prints it to the screen.
Where is the error?

Please note that this section of code
InputStreamReader ISR = new InputStreamReader (system.in, "iso8859-1");
BufferedReader br = new BufferedReader (ISR);
String strLine = Br.readline ();
The contents of the strline stored here are stored in the specified encoding (ISO8859-1) and converted to bytecode
(This Code strline.getbytes ()) uses the system's default gb2312 encoding, so of course
The output is garbled! The gb2312 encoded byte sequence is then used to construct a new string object.
ISO8859-1 encoding, so the output of garbled and System.out.println (StrLine) is different.

As for the correct method of modification can not be explained in detail, first of all, the strline is converted into bytes by iso8859-1 encoding method
Sequence, and then use the system default encoding (GB2312) to build a new string object and then print the output.

Reference:

String encoding (charset, encoding/decoding) Problem principle

http://www.javaeye.com/topic/31860

Topic: Java Coding Analysis (note three concepts to distinguish)

http://www.javaeye.com/topic/311583

Java character encoding

Http://wenku.baidu.com/view/3668f2d6195f312b3169a571.html

Java Character encoding Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More