Java basics-Explanation of Chinese Characters in Java

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java basics-Explanation of Chinese Characters in Java

Author: posting date: 2006.06.05 Source: jspcn

[Java area] [security area] [network management area] [Linux area] [access forum] [It blog]

[Eclipse] [PHP] [DB2] [Ajax] [struts] [Spring] [Source Code · Document Download]

Let me explain how Tomcat implements JSP.

Prerequisites:

1. bytes and Unicode

The Java kernel is Unicode, and even class files are multimedia files, including file/stream storage methods.

It uses byte streams. Therefore, Java needs to convert these bytes through rows. Char is Unicode, while byte is byte.

In Java, byte/Char functions are included in the middle of the sun. Io package. The bytetocharconverter class is medium scheduling,

It can be used to tell you the convertor you are using. Two of the most common static functions are:

Public static bytetocharconverter getdefault ();

Public static bytetocharconverter getconverter (string encoding );

If you do not specify converter, the system automatically uses the current encoding, GBK for the GB platform, and EN platform.

8859_1

Let's take a simple example:

"Your" GB code is: 0xc4e3, Unicode is 0x4f60

You use:

-- Encoding = "gb2312 ";

-- Byte B [] = {(byte) u00c4, (byte) u00e3 };

-- Convertor = bytetocharconverter. getconverter (encoding );

-- Char [] C = converter. convertall (B );

-- For (INT I = 0; I

--{

-- System. Out. println (integer. tohexstring (C [I]);

--}

-- Print 0x4f60

-- If 8859_1 encoding is used, the output is

-- 0x00c4, 0x00e3

---- Example 1

Reverse:

-- Encoding = "gb2312 ";

Char C [] = {u4f60 };

Convertor = bytetocharconverter. getconverter (encoding );

-- Byte [] B = converter. convertall (C );

-- For (INT I = 0; I

--{

-- System. Out. println (integer. tohexstring (B [I]);

--}

-- The output is 0xc4, 0xe3.

---- Example 2

-- If 8859_1 is used, it is 0x3f ,? Indicates that the conversion cannot be done --

Many Chinese problems are derived from these two simplest classes. But there are many

Encoding input is not directly supported, which brings us a lot of inconvenience. Encoding is rare for many programs

Now, use the default encoding directly, which brings us a lot of difficulties in porting.

2. UTF-8

-- UTF-8 is one-to-one correspondence with Unicode, its implementation is very simple

-- UNICODE: 0 _______

-- 11-bit Unicode: 1 0 _ 1 0 ______

-- 16-bit Unicode: 1 1 0 _ 1 0 _ 1 0 ______

-- 21-bit Unicode: 1 1 1 0 _ 1 0 _ 1 0 _ 1 0 ______

-- In most cases, only Unicode with less than 16 bits are used:

-- The GB code of "you" is 0xc4e3 and Unicode is 0x4f60

-- We still use the above example.

-- Example 1: 0xc4e3 binary:

-- 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1

-- Because we only have two digits, we can sort them according to their encoding, but we find that this is not feasible,

-- Because 7th bits are not 0, "?" is returned "? "

----

-- Example 2: 0x4f60 binary:

-- 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0

-- We use the UTF-8 to complete,:

-- 11100100 10111101 10100000

-- E4--BD -- A0

-- Returns 0xe4, 0xbd, 0xa0

----

3. String and byte []

-- String is actually the core of char [], but bytes must be encoded to be converted to string.

-- String. Length () is actually the length of the char array. If different codes are used

-- Error points can be generated, resulting in hashes and garbled characters.

-- Example:

---- Byte [] B = {(byte) u00c4, (byte) u00e3 };

---- String STR = new string (B, encoding );----

---- If encoding = 8859_1, there will be two words, but encoding = gb2312 only has one word ----

-- This problem occurs frequently when processing pages.

4. Reader, writer/inputstream, outputstream

-- The core of reader and writer is Char, and the core of inputstream and outputstream is byte.

-- But the main purpose of reader and writer is to read/write Char to inputstream/outputstream

-- Example of a reader:

--The test.txt file has only one "you" word, 0xc4, 0xe3 --

-- String encoding =;

-- Inputstreamreader reader = new inputstreamreader (

---- New fileinputstream ("text.txt"), encoding );

-- Char [] C = new char [10];

-- Int length = reader. Read (C );

-- For (INT I = 0; I

---- System. Out. println (C [I]);

-- If encoding is gb2312, there is only one character. If encoding = 8859_1, there are two characters

--------

----

2. We need to understand the Java compiler:

-- Javac-Encoding

We often do not use the encoding parameter. In fact, the encoding parameter is very important for cross-platform operations.

If encoding is not specified, the default encoding of the system is used. The GB platform is gb2312, and the English platform is iso8859_1.

-- The Java compiler actually calls the sun. Tools. javac. Main class to compile the file. This class --

There is an encoding variable in the compile function. The-encoding parameter is actually directly transmitted to the encoding variable.

The compiler is based on this variable to read the Java file, and then compiled into a class file in the form of UTF-8.

Example:

-- Public void test ()

--{

---- String STR = "you ";

---- Filewriter write = new filewriter ("test.txt ");

---- Write. Write (STR );

---- Write. Close ();

--}

---- Example 3

-- If you use gb2312 for compilation, you will find the E4 BD A0 field.

-- If 8859_1 is used for compilation,

-- 00c4 00e3 binary:

-- 00000000 11000100 00000000 11100011 --

-- Because each character must be greater than 7 characters, 11-bit encoding is used:

-- 11000001 10000100 11000011 10100011

-- C1 -- 84 -- C3 -- A3

-- You will find C1 84 C3 A3 --

However, we often ignore this parameter, so there are often cross-platform problems:

-- Example 3: compile on the Chinese platform to generate zhclass

-- Example 3: compile on the English platform and output the enclass

-- 1. zhclass runs OK on the Chinese platform, but not on the English platform.

-- 2. enclass runs OK on the English platform, but not on the Chinese Platform

Cause:

-- 1. After compiling on the Chinese platform, the char [] in the running state of STR is 0x4f60 ,----

-- Run on the Chinese platform. The default filewriter encoding is gb2312. Therefore

-- Chartobyteconverter will automatically use the converter that calls gb2312

Saidi network Java zone, Java developer Park!

SCID it blog, IT people's online home

[Comment] [recommended] [large, medium, and small] [print] [close]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java basics-Explanation of Chinese Characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java basics-Explanation of Chinese Characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support