Java character encoding

Source: Internet
Author: User
I. Summary
Character encoding is often encountered in Java applications, especially web-based programs. To prevent garbled characters, you must first understand how Java processes the characters, so that you can add the necessary transcoding in the input/output step purposefully. Second, because various servers have different processing methods, more experiments are required to ensure that no garbled characters are involved.
Ii. Basic Concepts
2.1 expression of characters in Java
Java contains the concepts of char, byte, and string. Char refers to a Unicode character, which is a 16-digit integer. Byte is a byte, and the string needs to be converted to a byte array before network transmission or storage. After receiving data from the network or reading data from the storage device, convert the byte array to a string. String is a string and can be viewed as an array consisting of char. String and Char are in the memory format, and byte is the serialization form of network transmission or storage.
Example:
English
String Ying = "English ";
Char Ying = Ying. charat (0 );
String yinghex = integer. tohexstring (Ying );
82 F1
Byte yinggbbytes = Ying. getbytes ("GBK ");
Byte value in GB encoding
D3 A2

2.2 Introduction to encoding methods
When serializing a string to a byte array or deserialization, You must select the correct encoding method. If the encoding method is incorrect, some 0x3f values are obtained. Commonly used character encoding methods include iso8859_1, gb2312, GBK, UTF-8/UTF-16/UTF-32.
Iso8859_1 is used to encode Latin. It consists of a single byte (0-255.
Gb2312 and GBK are used to encode simplified Chinese characters. They are composed of single-byte and dual-byte combinations. The byte whose maximum bit is 1 and the next byte constitute a Chinese character. The byte whose maximum bit is 0 is ASCII code.
UTF-8/UTF-16/UTF-32 is an international standard unicode encoding method. The most used is the UTF-8, mainly because it is about space during the Latin encoding.

Unicode value UTF-8 Encoding
U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx
U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

3. Related functions in j2se
String STR = "English ";
// Obtain the gb2312 encoded byte
Byte [] bytesgb2312 = Str. getbytes ("gb2312 ");

// Obtain the default platform encoding byte (Solaris is iso8859_1, Windows is gb2312)
Byte [] bytesdefault = Str. getbytes ();

// Convert bytes into strings using the specified Encoding
String newstrgb = new string (bytesgb2312, "gb2312 ");
 
// Convert the byte into a string using the default platform encoding (Solaris is iso8859_1, Windows is gb2312)
String newstrdefault = new string (bytesdefault );

// Read characters from the byte stream with the specified Encoding
Inputstream in = xxx;
Inputstreamreader reader = inputstreamreader (in, "gb2312 ");
Char Achar = reader. Read ();
Iv. jsp and database Encoding
4.1 encoding in JSP
(1) Static declaration:
Charset has two functions:
JSP file encoding method: the Chinese character encoding in the source JSP file when the JSP file is read and Java class is generated
Encoding method of JSP output stream: encoding method for writing data to the response stream during JSP execution
(2) Dynamic Change: Call response. setcontenttype () to set the correct encoding type before writing data to the response stream.
(3) In tomcat, the parameters obtained by request. getparameter () are all encoded in iso8859_1. Therefore, if you enter a Chinese character "English" in the browser input box, you will get an iso8859_1 encoded (0x00, 0xd3, 0x00, 0xa2) on the server side ). Therefore, transcoding is usually performed when parameters are received:
String wrongstr = response. getparameter ("name ");
String correctstr = new string (wrongstr. getbytes ("iso8859_1"), "gb2312 ");
In the latest servlet specification, you can also execute the following code before obtaining the parameters:
Request. setcharacterencoding ("gb2312 ");

4.2 encoding of the database
(1) database use UTF-16
If the string contains Unicode characters, transcoding is not required during write and read operations.
(2) The database uses iso8859_1
If the string contains Unicode characters, transcoding is required during write and read operations.
Write: String newstr = new string (oldstr. getbyte ("gb2312"), "iso8859_1 ");
Read: String newstr = new string (oldstr. getbyte ("iso8859_1"), "gb2312 ");
V. source file encoding
5.1 resource files
The encoding method of resource files is related to the editing platform. Resource files written on Windows platforms are encoded in gb2312 format. Transcoding is required during compilation to ensure correctness on various platforms:
Native2ascii-encoding gb2312 source. Properties
In this way, the correct Unicode string is read from the resource file.
5.2 source file
The source file encoding method is related to the editing platform. Source files developed on Windows platforms are encoded in gb2312 format. During compilation, you must specify the source file encoding method:
Javac-encoding gb2312
The encoding of byte files generated after Java compilation is UTF-8.
① The latest version of Tomcat 4.1.18 supports request. setcharacterencoding (string ENC)
② Transcode the resource file to company. Name =/u82f1/u65af/u514b
③ This part of transcoding is not required if the database uses a UTF-16
④ There should be
Transcoding I:
String S = New String
(Request. getparameter ("name"). getbytes ("iso8859_1"), "gb2312 ");
Transcoding II:
String S = new string (name. getbytes ("gb2312"), "iso8859_1 ");
Transcoding III:
String S = new string (name. getbytes ("iso8859_1"), "gb2312 ");

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.