Java coding Comprehension

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java coding comprehension keywords: Java coding, Unicode, getbytes () come from my notes: youdao note-Java coding understanding the coding involved in various programs is nothing more than the encoding of source code files, the encoding of the string in the program. java is known as an international language because its class files use UTF-8 while JVM runtime uses UTF-16. From the talk about Unicode in Java, encoding understanding http://blog.csdn.net/soleghost/article/details/959832, Java program compilation and operation process involves the encoding conversion.

1. javac reads the source file based on the system's default encoding and generates a class file encoded by Unicode. From Java getbytes coding method http://nopainnogain.iteye.com/blog/970628

In Windows, the default encoding is generally GBK, which can be set through Lang in Linux.

You can change the encoding mode of javac reading source files by specifying the encoding mode.

Javac-encoding GBK test. Java

If the encoding of a source program file is GBK and the program involves a Chinese character string, the encoding method of the Chinese character entered manually is the encoding method of the current file GBK .)

At this point, if lang = UTF-8, the following warning appears when compiling a program using javac test3.java:

This is because javac according to the system default encoding UTF-8 to read the file, encountered inside is actually GBK Chinese characters can not find the corresponding characters in the UTF-8 character set, so there is a warning.

When javac-encoding GBK test3.java is used, no warning is reported.

2. During the program running, the conversion process is involved when getbytes (), new string (), and println are called to print the string display.

The strings in Java are stored in Unicode mode. All encoding conversions are from the original encoding to Unicode, and then from Unicode to the target encoding.

The following procedure:

IfGetbytes (),New String ()If no relevant encoding is specified, it will be processed according to the system default encoding.

For example, gbkstr. getbytes () returns the byte stream according to the system's default encoding, while new string (bytes) considers the bytes byte stream as the system's default encoding.

In the above program, gbkstr. getbytes (). Length and utfstr. getbytes (). length will have different results at runtime due to different Lang.

Lang = zh_cn.gb18030, the terminal code is GBK, and the running result is:

Lang = zh_CN.UTF-8, Terminal code as UTF-8, run result:

From the running results, we have a question: why is the first string displayed normally while the second string is not displayed normally.

Let's look at this line of code new string (gbkstr. getbytes ("GBK"), "UTF-8)

This code may be understood as converting the string of GBK to the string of the UTF-8, which is a big mistake, in fact this line of code has a serious error, the characters stored in utfstr are not decoded properly.

We have repeatedly mentioned that the strings in Java are unicode encoded, so there is no GBK encoded string or UTF-8 encoded string"The naive saying is that the variable definition of the above program
Gbkstr and utfstr are also childish behaviors.

The actual meaning of this sentence is:Encode the gbkstr object as byte [] According to the GBK method, and thenByte []FollowUTF-8StorageString,(This sentence is excerpted from the http://blog.csdn.net/soleghost/article/details/959832 about Unicode and encoding in Java.

The actual operation involved in this sentence is: gbkstr (UNICODE character) is converted to GBK character, and GBK character is converted to byte [] (that is, the conversion result is stored in bytes ), then bytes [] is converted to the UTF-8 character, the UTF-8 character is converted to Unicode (that is, the byte [] is considered to be the bytecode of the UTF-8 character, the byte [] is converted to Unicode ).

When a GBK-encoded byte [] is stored as utf8, it is obvious that an error occurs, and the correct code is like this. Byte [] encoded in UTF-8 by gbkstr object, stored in utfstr as UTF-8.

Lang = zh_cn.gb18030, the terminal code is GBK, and the running result is:

Lang = zh_CN.UTF-8, Terminal code as UTF-8, run result:

Summary:

Correct code

Running results in different environments:

Note,

Here, because the source code file is GBK encoding, so it is GBK encoding, if the file is UTF-8 encoding, the running results are still the same, the results are irrelevant to the character encoding in the source code file, this is because the unidoce has been completed when the value is assigned to gbkstr.
Conversion,What is stored in gbkstr is a unified Unicode (gbkstr. Length () is actually the number of Unicode characters in the string). If you want to read a string from a file, instead of assigning values using this direct statement, you need to specify the file encoding format when reading the file to get the correct string. To get an encoded output file, you only need to specify the encoding format when outputting string to the file, which is very convenient.

Therefore:

To get a string byte array encoded according to the UTF-8 and GBK, you can directly use string. getbytes (encoding), without complex conversion.

The above programs print normally in different Lang and terminal environments because there is a unidoce ==> conversion of Other encoding formats when outputting to the terminal.

From this we can see that it is very convenient to use unidoce for unified encoding of strings in Java (which is also true for Python), rather than ensuring the string encoding and Lang in C/C ++, the printing is normal only when the terminal environment is consistent.

We recommend that you specify the encoding when getbytes and new string are used.

In the above Code, the content (UNICODE) stored in gbkstr and utfstr are the same. They all correspond to 28 characters, and one English or Chinese character is only one character. let's take a look at the content:

Unicode Value

In addition, we can also take a look at the GBK byte stream and UTF-8 byte stream content.

GBK byte stream (GBK Chinese characters are encoded in two bytes. English letters are encoded in one byte.)

UTF-8 byte stream (UTF-8 Chinese character encoding method is variable length, here several Chinese characters are encoded by 3 bytes)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java coding Comprehension

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java coding Comprehension

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support