The number of characters encoded and strings in Java.

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The number of bytes occupied by strings in Java is closely related to character encoding.

Java coding can actually involve these aspects: ide encoding, operating system default encoding, and Java character encoding.

For example, when using eclipse to write a Java program, you can set the Java program encoding in the project properties. If this parameter is not set, the program encoding is the operating system encoding by default, the encoding set here is the code file encoding; or when we use Vim to write a Java program, you can set the system's environment variable Lang, such as zh_CN.UTF-8, zh_cn.gb18030, etc. At this time, the code file encoding is the encoding specified by Lang. This is the IDE encoding, IDE encoding is very important, for example, a Java code file is UTF-8 encoding, and your ide is gb18030 encoding, then the display will appear garbled.

The character encoding in Java refers to the encoding of a string in Java. For example, the following program is used to calculate the number of bytes occupied by a string and runs on Windows 7:

[Java]
View plaincopyprint?

Public class charset {
Public static void main (string [] ARGs ){
// Todo auto-generated method stub
String MSG = "China ABC ";
System. Out. println (MSG );
Int Len = MSG. getbytes (). length; // encoding by default Operating System Encoding
System. Out. println (LEN );
Try {
Len = msg. getbytes ("gb2312"). length; // output 7
System. Out. println ("gb2312:" + Len );
Len = msg. getbytes ("GBK"). length; // output 7
System. Out. println ("GBK:" + Len );
Len = msg. getbytes ("gb18030"). length; // output 7, 2*2 + 3. One Chinese Character occupies 2 bytes, one English letter and one byte
System. Out. println ("gb18030:" + Len );
Len = msg. getbytes ("UTF-8"). length; // output 9, 2*3 + 3 = 9, a Chinese character occupies 3 bytes, an English letter a byte.
System. Out. println ("UTF-8:" + Len );
Len = msg. getbytes ("UTF-16"). length; // OUTPUT 12
System. Out. println ("UTF-16:" + Len );
Len = msg. getbytes ("UTF-32"). length; // output 20
System. Out. println ("UTF-32:" + Len );
Len = msg. getbytes ("Unicode"). length; // OUTPUT 12
System. Out. println ("UNICODE:" + Len );
} Catch (Java. Io. unsupportedencodingexception E)
{
System. Out. println (E. getmessage (). tostring ());
}
}
}

Public class charset { Public static void main (string [] ARGs) { // todo auto-generated method stub string MSG = "China ABC"; system. out. println (MSG); int Len = MSG. getbytes (). length; // encoded by default Operating System encoding system. out. println (LEN); try { Len = MSG. getbytes ("gb2312 "). length; // output 7 system. out. println ("gb2312:" + Len); Len = MSG. getbytes ("GBK "). length; // output 7 system. out. println ("GBK:" + Len); Len = MSG. getbytes ("gb18030 "). length; // output 7, 2*2 + 3. One Chinese Character occupies 2 bytes, one English letter and one byte system. out. println ("gb18030:" + Len); Len = MSG. getbytes ("UTF-8 "). length; // output 9, 2*3 + 3 = 9. A Chinese Character occupies 3 bytes, and an English letter occupies 1 byte. system. out. println ("UTF-8:" + Len); Len = MSG. getbytes ("UTF-16 "). length; // OUTPUT 12 system. out. println ("UTF-16:" + Len); Len = MSG. getbytes ("UTF-32 "). length; // output 20 system. out. println ("UTF-32:" + Len); Len = MSG. getbytes ("Unicode "). length; // OUTPUT 12 system. out. println ("UNICODE:" + Len); } catch (Java. io. unsupportedencodingexception e) { system. out. println (E. getmessage (). tostring (); } }

Program output:

China ABC
7
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12

Analysis:
Len = MSG. getbytes (). the value of length is 7, because the Windows 7 operating system character encoding is GBK (gb2312, GBK or gb18030), Java uses the default Operating System encoding to encode characters when running the program, therefore, the number of characters in bytes is 7.

If this program is placed,

[Plain]
View plaincopyprint?

[Zhankunlin @ icthtc javatest] $ export lang = zh_cn.gb18030
[Zhankunlin @ icthtc javatest] $ Vim charset. Java (when writing a Java code file, the code used is zh_cn.gb18030, that is, the code file encoding is gb18030)
[Zhankunlin @ icthtc javatest] $ javac charset. Java
[Zhankunlin @ icthtc javatest] $ Java charset (lang = zh_cn.gb18030, that is, the default encoding is gb18030)
China ABC
7 (the default encoding is gb18030, so it occupies 7 bytes)
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12
[Zhankunlin @ icthtc javatest] $ export lang = zh_CN.UTF-8 (Change System code to UTF-8)
[Zhankunlin @ icthtc javatest] $ Java charset
Juan .. ABC (because the xshell Terminal code is not set to UTF-8, so the print garbled)
9 (the operating system encoding is UTF-8, so it occupies 9 bytes)
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12

[Zhankunlin @ icthtc javatest] $ export lang = zh_cn.gb18030 [zhankunlin @ icthtc javatest] $ Vim charset. java (when writing a Java code file, the code used is zh_cn.gb18030, that is, the code in the code file is gb18030) [zhankunlin @ icthtc javatest] $ javac charset. java [zhankunlin @ icthtc javatest] $ Java charset (lang = zh_cn.gb18030, that is, the default encoding is gb18030) China ABC 7 (the default system encoding is gb18030, so it occupies 7 bytes) gb2312: 7 GBK: 7 gb18030: 7 UTF-8: 9 UTF-16: 12 UTF-32: 20 UNICODE: 12 [zhankunlin @ icthtc javatest] $ export lang = zh_CN.UTF-8 (Change System code to UTF-8) [zhankunlin @ icthtc javatest] $ Java charset Juan .. ABC (because the xshell terminal encoding is not set to UTF-8, so the printing garbled) 9 (the operating system encoding is UTF-8, So 9 bytes) gb2312: 7 GBK: 7 gb18030: 7 UTF-8: 9 UTF-16: 12 UTF-32: 20 UNICODE: 12[Plain]
View plaincopyprint?

{Set the xshell terminal encoding to UTF-8}

{Set the xshell terminal encoding to UTF-8}[Plain]
View plaincopyprint?

[Zhankunlin @ icthtc javatest] $ Java charset
China ABC (print normal)
9
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12
[Zhankunlin @ icthtc javatest] $ Vim charset. Java

[Zhankunlin @ icthtc javatest] $ Java charset China ABC (print normal) 9 gb2312: 7 GBK: 7 gb18030: 7 UTF-8: 9 UTF-16: 12 UTF-32: 20 UNICODE: 12 [zhankunlin @ icthtc javatest] $ Vim charset. java[Plain]
View plaincopyprint?

[Zhankunlin @ icthtc javatest] $ javac charset. java (the program code file encoding is gb18030, while the system encoding is UTF-8 at the time of compilation, the compiler will read the code file for compilation in the way of operating system encoding if there is no specified, so there is a warning)
Charset. Java: 6: bytes ?. Forbidden .??. Utf8 ?.??.. Rejected... Enabled
String MSG = "ABC ";
^
Charset. Java: 6: bytes ?. Forbidden .??. Utf8 ?.??.. Rejected... Enabled
String MSG = "ABC ";

[Zhankunlin @ icthtc javatest] $ javac charset. java (the program code file encoding is gb18030, while the system encoding is UTF-8 at the time of compilation, the compiler will read the code file for compilation in the way of operating system encoding if there is no specified, so there is a warning) charset. java: 6: runtime ?. Forbidden .??. Utf8 ?.??.. Forbidden .. Sorry string MSG = "ABC"; ^ charset. Java: 6: forbidden ?. Forbidden .??. Utf8 ?.??.. Summary .. summary string MSG = "ABC ";[Plain]
View plaincopyprint?

[Zhankunlin @ icthtc javatest] $ javac-encoding gb18030 charset. Java (if you use the-encoding option to specify the encoding format of the program file, compilation will not fail)
[Zhankunlin @ icthtc javatest] $ Java charset {print normal, because the xshell terminal encoding has been set to UTF-8 }}
China ABC
9
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12

[Zhankunlin @ icthtc javatest] $ javac-encoding gb18030 charset. java (if you use the-encoding option to specify the encoding format of the program file, the compilation will not fail.) [zhankunlin @ icthtc javatest] $ Java charset {print normal, because the xshell terminal encoding has been set to UTF-8 }} China ABC 9 gb2312: 7 GBK: 7 gb18030: 7 UTF-8: 9 UTF-16: 12 UTF-32: 20 UNICODE: 12 [Plain]
View plaincopyprint?

<PRE>

<PRE>[Plain]
View plaincopyprint?

</PRE> <PRE name = "code" class = "plain">

</PRE> <PRE name = "code" class = "plain">[Plain]
View plaincopyprint?

</PRE> <PRE name = "code" class = "plain">

</PRE> <PRE name = "code" class = "plain">[Plain]
View plaincopyprint?

</PRE> <PRE name = "code" class = "plain"> <PRE>

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The number of characters encoded and strings in Java.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The number of characters encoded and strings in Java.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support