Java basics-Explanation of Chinese Characters in Java

Source: Internet
Author: User
Java basics-Explanation of Chinese Characters in Java
Author: posting date: 2006.06.05 Source: jspcn
[Java area] [security area] [network management area] [Linux area] [access forum] [It blog]
[Eclipse] [PHP] [DB2] [Ajax] [struts] [Spring] [Source Code · Document Download]


 

 

 

 

 

 

 

 

Let me explain how Tomcat implements JSP.

Prerequisites:

1. bytes and Unicode

The Java kernel is Unicode, and even class files are multimedia files, including file/stream storage methods.

It uses byte streams. Therefore, Java needs to convert these bytes through rows. Char is Unicode, while byte is byte.

In Java, byte/Char functions are included in the middle of the sun. Io package. The bytetocharconverter class is medium scheduling,

It can be used to tell you the convertor you are using. Two of the most common static functions are:

Public static bytetocharconverter getdefault ();

Public static bytetocharconverter getconverter (string encoding );

If you do not specify converter, the system automatically uses the current encoding, GBK for the GB platform, and EN platform.

8859_1

  

Let's take a simple example:

"Your" GB code is: 0xc4e3, Unicode is 0x4f60

You use:

-- Encoding = "gb2312 ";

-- Byte B [] = {(byte) u00c4, (byte) u00e3 };

-- Convertor = bytetocharconverter. getconverter (encoding );

-- Char [] C = converter. convertall (B );

-- For (INT I = 0; I

--{

-- System. Out. println (integer. tohexstring (C [I]);

--}

-- Print 0x4f60

-- If 8859_1 encoding is used, the output is

-- 0x00c4, 0x00e3

---- Example 1

Reverse:

-- Encoding = "gb2312 ";

Char C [] = {u4f60 };

Convertor = bytetocharconverter. getconverter (encoding );

-- Byte [] B = converter. convertall (C );

-- For (INT I = 0; I

--{

-- System. Out. println (integer. tohexstring (B [I]);

--}

-- The output is 0xc4, 0xe3.

---- Example 2

-- If 8859_1 is used, it is 0x3f ,? Indicates that the conversion cannot be done --

Many Chinese problems are derived from these two simplest classes. But there are many

Encoding input is not directly supported, which brings us a lot of inconvenience. Encoding is rare for many programs

Now, use the default encoding directly, which brings us a lot of difficulties in porting.

--

2. UTF-8

-- UTF-8 is one-to-one correspondence with Unicode, its implementation is very simple

--

-- UNICODE: 0 _______

-- 11-bit Unicode: 1 0 _ 1 0 ______

-- 16-bit Unicode: 1 1 0 _ 1 0 _ 1 0 ______

-- 21-bit Unicode: 1 1 1 0 _ 1 0 _ 1 0 _ 1 0 ______

-- In most cases, only Unicode with less than 16 bits are used:

-- The GB code of "you" is 0xc4e3 and Unicode is 0x4f60

-- We still use the above example.

-- Example 1: 0xc4e3 binary:

-- 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1

-- Because we only have two digits, we can sort them according to their encoding, but we find that this is not feasible,

-- Because 7th bits are not 0, "?" is returned "? "

----

-- Example 2: 0x4f60 binary:

-- 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0

-- We use the UTF-8 to complete,:

-- 11100100 10111101 10100000

-- E4--BD -- A0

-- Returns 0xe4, 0xbd, 0xa0

----

3. String and byte []

-- String is actually the core of char [], but bytes must be encoded to be converted to string.

-- String. Length () is actually the length of the char array. If different codes are used

-- Error points can be generated, resulting in hashes and garbled characters.

-- Example:

---- Byte [] B = {(byte) u00c4, (byte) u00e3 };

---- String STR = new string (B, encoding );----

---- If encoding = 8859_1, there will be two words, but encoding = gb2312 only has one word ----

-- This problem occurs frequently when processing pages.

4. Reader, writer/inputstream, outputstream

-- The core of reader and writer is Char, and the core of inputstream and outputstream is byte.

-- But the main purpose of reader and writer is to read/write Char to inputstream/outputstream

-- Example of a reader:

--The test.txt file has only one "you" word, 0xc4, 0xe3 --

-- String encoding =;

-- Inputstreamreader reader = new inputstreamreader (

---- New fileinputstream ("text.txt"), encoding );

-- Char [] C = new char [10];

-- Int length = reader. Read (C );

-- For (INT I = 0; I

---- System. Out. println (C [I]);

-- If encoding is gb2312, there is only one character. If encoding = 8859_1, there are two characters

--------

--

--

  

----

2. We need to understand the Java compiler:

-- Javac-Encoding

We often do not use the encoding parameter. In fact, the encoding parameter is very important for cross-platform operations.

If encoding is not specified, the default encoding of the system is used. The GB platform is gb2312, and the English platform is iso8859_1.

-- The Java compiler actually calls the sun. Tools. javac. Main class to compile the file. This class --

There is an encoding variable in the compile function. The-encoding parameter is actually directly transmitted to the encoding variable.

The compiler is based on this variable to read the Java file, and then compiled into a class file in the form of UTF-8.

Example:

-- Public void test ()

--{

---- String STR = "you ";

---- Filewriter write = new filewriter ("test.txt ");

---- Write. Write (STR );

---- Write. Close ();

--}

---- Example 3

-- If you use gb2312 for compilation, you will find the E4 BD A0 field.

--

-- If 8859_1 is used for compilation,

-- 00c4 00e3 binary:

-- 00000000 11000100 00000000 11100011 --

-- Because each character must be greater than 7 characters, 11-bit encoding is used:

-- 11000001 10000100 11000011 10100011

-- C1 -- 84 -- C3 -- A3

-- You will find C1 84 C3 A3 --

    

However, we often ignore this parameter, so there are often cross-platform problems:

-- Example 3: compile on the Chinese platform to generate zhclass

-- Example 3: compile on the English platform and output the enclass

-- 1. zhclass runs OK on the Chinese platform, but not on the English platform.

-- 2. enclass runs OK on the English platform, but not on the Chinese Platform

Cause:

-- 1. After compiling on the Chinese platform, the char [] in the running state of STR is 0x4f60 ,----

-- Run on the Chinese platform. The default filewriter encoding is gb2312. Therefore

-- Chartobyteconverter will automatically use the converter that calls gb2312

Saidi network Java zone, Java developer Park!

SCID it blog, IT people's online home
[Comment] [recommended] [large, medium, and small] [print] [close]

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.