QU: What is "ISO 8859-1" about character sets? Are there other character sets? What is the difference?

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, I checked the java. util. Properties Api Doc and found that the load () method of Properties requires that the InputStream to be read must be the ISO 8859-1 character set.

When I open NodePad and choose to save it, I also find that the "encoding" option will appear in the system above 2000, which contains "ANSI", "Unicode ", "Unicode big endian", "UTF-8" encoding.
Can you tell me about the encoding of these character sets and Their Impact on java. io usage.

_________________________________________________________________________________________

In-depth discussion on Chinese character issues

I. Topic: questions about JAVA Chinese
JAVA has a prominent Chinese problem, mainly manifested in Control Panel output, JSP page output, and database access.
This article tries to avoid the font problem, but only about encoding. Through this article, you can understand the origins and problems of JAVA Chinese
In this example, the method for accessing the database using JDBC is provided.

Ii. Problem description:
1) Compile and run in the Chinese W2000 chinese window. The jdk of the international version is used to connect Cp936 of the Chinese W2000.
Encoded SQL SERVER database:

J: \ exercise \ demo \ encode \ HelloWorld> make
Created by XCompiler. PhiloSoft All Rights Reserved.
Wed May 30 02:54:45 CST 2001

J: \ exercise \ demo \ encode \ HelloWorld> run
Created by XRunner. PhiloSoft All Rights Reserved.
Wed May 30 02:51:33 CST 2001
Chinese
[B @ 7bc8b569
[B @ 7b08b569
[B @ 7860b569
Chinese
Chinese
????
Chinese
Chinese
????
??
??
??

2) If it is compiled in the Chinese W2000 Spanish window (encoded as 437), it cannot be corrected because there is no font to run it in JAVA.
It is often shown that, if you run in the Chinese Windows of the Chinese W2000 as above, the output is:

J: \ exercise \ demo \ encode \ HelloWorld> run
Created by XRunner. PhiloSoft All Rights Reserved.
Wed May 30 02:51:33 CST 2001
????
[B @ 7bc0b66a
[B @ 7b04b66a
[B @ 7818b66a
????
????
????
????
????
????
Chinese
Chinese
????

3) Analysis

1) garbled characters (that is,?) appear ?). Because only? If there is no small box, it means there is a problem with encoding, instead
It is a font problem. In encoding, If you convert from one character set to another
To ISO8859_1 (ASCII), many Chinese characters (half Chinese characters) cannot be mapped to Western characters.
In this case, the system uses? . Similarly, small character sets cannot be larger
The specific reason is not detailed here.

2) When the Chinese environment is compiled, Chinese characters are displayed correctly or incorrectly during the running of the Chinese environment. Similarly
Compiling in the Chinese environment. This is because automatic (default) or manual (also
The transcoding result of new String (bytes [, encode]) and bytes getBytes ([encode.

2.1) in the JAVA source file --> JAVAC --> Class --> Java --> getBytes () --> new String () -->
Each step has a conversion process. This process always exists, but sometimes it uses the default parameters
Line. Next we will analyze the cause of the above situation step by step.

2.2) Here is the source code:

HelloWorld. java:
------------------------
Public class HelloWorld
{
Public static void main (String [] argv ){
Try {
System. out. println ("Chinese"); // 1
System. out. println ("Chinese". getBytes (); // 2
System. out. println ("Chinese". getBytes ("GB2312"); // 3
System. out. println ("Chinese". getBytes ("ISO8859_1"); // 4

System. out. println (new String ("Chinese". getBytes (); // 5
System. out. println (new String ("Chinese". getBytes (), "GB2312"); // 6
System. out. println (new String ("Chinese". getBytes (), "ISO8859_1"); // 7

System. out. println (new String ("Chinese". getBytes ("GB2312"); // 8
System. out. println (new String ("Chinese". getBytes ("GB2312"), "GB2312"); // 9
System. out. println (new

String ("Chinese". getBytes ("GB2312"), "ISO8859_1"); // 10

System. out. println (new String ("Chinese". getBytes ("ISO8859_1"); // 11
System. out. println (new

String ("Chinese". getBytes ("ISO8859_1"), "GB2312"); // 12
System. out. println (new

String ("Chinese". getBytes ("ISO8859_1"), "ISO8859_1"); // 13
}
Catch (Exception e ){
E. printStackTrace ();
}
}
}

For convenience, the Operation Sequence Number is added after each conversion, which is 1, 2,..., 13.

2.3) It should be noted that JAVAC reads the source file in the default system encoding and then encodes it according to UNICODE. In
When JAVA is running, JAVA uses UNICODE encoding. By default, the input and output operations are performed by the operating system.
Encoding, that is, in new String (bytes [, encode]), the system considers that the input is encoded as encode.
Byte stream. In other words, if bytes is translated by encode, the correct result is obtained.
Save in VA, it still needs to convert from this encode to Unicode, that is, bytes --> encode character --> Uni
In String. getBytes ([encode]), the system must make a Unicode Character --> enco
De character --> bytes conversion.

In this example, except for the English Window encoding, the default encoding is GBK (in this example
For the moment, we will treat GBK and GB2312 in the same way ).

2.4) because the above two code-based conversions are not specified, if the encode is not specified, the system uses
Recognize the encoding (here GBK), we think the above 5, 6, 7 and 8, 9, 10 is the same, 8 and 9, 11 and 12 are also
So we will only discuss 1, 9, 10, 12, 13 in the discussion. The 2, 3, and 4 are only used for testing.
Within the scope of the discussion.

2.5) Next we will track the conversion process of the "medium" word in the program. let's first talk about the compilation and operation in the Chinese window.
During the row process, note that in the following letter subscript, I consciously use some numbers to indicate the same, different or
Related 2.5.1) Let's take Code 9 from the above 13 code segments as an example:

Location of steps
01: C1 HelloWorld. java C1 generally refers to a GBK character
02: U1 JAVAC reading U1 refers to a Unicode character.
03: C1 getBytes () Step 1 JAVA first communicates with the operating system
04: B1, B2 getBytes () Step 2 and then return the byte array
05: C1 new String () Step 1 JAVA first communicates with the operating system
06: U1 new String () Step 2 and then return the character
07: C1 println (String) can display the word "medium". The content is the same as the original one.

2.5.2) and then take code segment 10 as an example. We noticed that:

Location of steps
01: C1 HelloWorld. java C1 generally refers to a GBK character
02: U1 JAVAC reading U1 refers to a Unicode character.
03: C1 getBytes () Step 1 JAVA first communicates with the operating system
04: B1, B2 getBytes () Step 2 and then return the byte array
05: C3, C4 new String () Step 1 JAVA first communicates with the operating system, and the parsing error occurs.
06: U5, U6 new String () Step 2 and then return characters
07: C3, C4 println (String) because the Chinese characters are divided into two halves, there is no character in ISO8859_1

Can be mapped, so it is displayed as "?". In the preceding example,
"Chinese" is displayed as "???"
2.5.3) I will not say much about other similar situations in full Chinese mode.

2.6) Let's see why the classes compiled in the Spanish DOS window are similar in the Chinese window.
It is not why Chinese characters can be correctly displayed in some situations.

2.6.1) Let's take code segment 9 as an example:

Location of steps
01: C1C2 HelloWorld. java C1C2 refers to an ISO8859_1 character, and the word "medium" is split.
02: U3U4 JAVAC reading U1U2 refers to a Unicode Character
03: C5C6 getBytes () Step 1 JAVA first communicates with the operating system, and the parsing error occurs.
04: B5B6B7B8 getBytes () Step 2 and then return the byte array
05: C5C6 new String () Step 1 JAVA first communicates with the operating system
06: U3U4 new String () Step 2 and then return the character
07: C5C6 println (String) Although it is two characters in the same way, it is no longer the first "Two ISO8859_1 characters

", But" Two BGK characters "." medium "is displayed as"?"
"Chinese" is displayed as "???"

2.6.2) the following uses code segment 12 as an example because it correctly displays Chinese characters.

Location of steps

01: C1C2 HelloWorld. java C1C2 refers to an ISO8859_1 character, and the word "medium" is split.
02: U3U4 JAVAC reading U1U2 refers to a Unicode Character
03: C1C2 getBytes () Step 1 JAVA first communicates with the operating system (note that it is correct !)
04: B5B6 getBytes () Step 2 and then return the byte array (this is a key step !)
05: C12 new String () Step 1 JAVA first communicates with the operating system (this is a more critical step, JAVA has known
Channel B5B6 must be parsed into a Chinese character !)
06: U7 new String (), step 2, and then return the character (it is a real item two! U7 contains U3U4 Information)
07: C12 println (String) This is the original "medium" word, very wronged by JAVAC once, but it was
The sequencer made a mistake! Of course, the word "Chinese" can be correctly displayed!

3) Why is JDBC used sometimes?
New String (Recordset. getBytes (int) [, encode])
Recordset. getSting (int)
Recordset. setBytes (String. getBytes ([encode])
And
Recordset. setString (String)
When there will be garbled characters?

In fact, the problem occurs when compiling JDBC and the encoding problem is also taken into account. After it reads data from the database, it may be self-owned.
Zhang made a conversion from GB2312 (default encoding) to Unicode. My WebLogic For SQL Server

The JDBC Driver of is like this. When I read the string, I send out the incorrect Chinese characters and hate me.
However, you can directly write Chinese character strings, which is somewhat unacceptable!
That is to say, we have to perform transcoding during reading or writing, although this transcoding is not so obvious sometimes,
This is because we use the default encoding for transcoding. The operations performed by the JDBC Driver only go to the source
Internal code can be clear, isn't it?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

QU: What is "ISO 8859-1" about character sets? Are there other character sets? What is the difference?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

QU: What is "ISO 8859-1" about character sets? Are there other character sets? What is the difference?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support