The number of bytes occupied by strings in Java is closely related to character encoding.
Java coding can actually involve these aspects: ide encoding, operating system default encoding, and Java character encoding.
For example, when using eclipse to write a Java program, you can set the Java program encoding in the project properties. If this parameter is not set, the program encoding is the operating system encoding by default, the encoding set here is the code file encoding; or when we use Vim to write a Java program, you can set the system's environment variable Lang, such as zh_CN.UTF-8, zh_cn.gb18030, etc. At this time, the code file encoding is the encoding specified by Lang. This is the IDE encoding, IDE encoding is very important, for example, a Java code file is UTF-8 encoding, and your ide is gb18030 encoding, then the display will appear garbled.
The character encoding in Java refers to the encoding of a string in Java. For example, the following program is used to calculate the number of bytes occupied by a string and runs on Windows 7:
[Java]
View plaincopyprint?
- Public class charset {
- Public static void main (string [] ARGs ){
- // Todo auto-generated method stub
- String MSG = "China ABC ";
- System. Out. println (MSG );
- Int Len = MSG. getbytes (). length; // encoding by default Operating System Encoding
- System. Out. println (LEN );
- Try {
- Len = msg. getbytes ("gb2312"). length; // output 7
- System. Out. println ("gb2312:" + Len );
- Len = msg. getbytes ("GBK"). length; // output 7
- System. Out. println ("GBK:" + Len );
- Len = msg. getbytes ("gb18030"). length; // output 7, 2*2 + 3. One Chinese Character occupies 2 bytes, one English letter and one byte
- System. Out. println ("gb18030:" + Len );
- Len = msg. getbytes ("UTF-8"). length; // output 9, 2*3 + 3 = 9, a Chinese character occupies 3 bytes, an English letter a byte.
- System. Out. println ("UTF-8:" + Len );
- Len = msg. getbytes ("UTF-16"). length; // OUTPUT 12
- System. Out. println ("UTF-16:" + Len );
- Len = msg. getbytes ("UTF-32"). length; // output 20
- System. Out. println ("UTF-32:" + Len );
- Len = msg. getbytes ("Unicode"). length; // OUTPUT 12
- System. Out. println ("UNICODE:" + Len );
- } Catch (Java. Io. unsupportedencodingexception E)
- {
- System. Out. println (E. getmessage (). tostring ());
- }
- }
- }
Public class charset {</P> <p> Public static void main (string [] ARGs) {<br/> // todo auto-generated method stub <br/> string MSG = "China ABC"; <br/> system. out. println (MSG); <br/> int Len = MSG. getbytes (). length; // encoded by default Operating System encoding <br/> system. out. println (LEN); <br/> try {<br/> Len = MSG. getbytes ("gb2312 "). length; // output 7 <br/> system. out. println ("gb2312:" + Len); <br/> Len = MSG. getbytes ("GBK "). length; // output 7 <br/> system. out. println ("GBK:" + Len); <br/> Len = MSG. getbytes ("gb18030 "). length; // output 7, 2*2 + 3. One Chinese Character occupies 2 bytes, one English letter and one byte <br/> system. out. println ("gb18030:" + Len); <br/> Len = MSG. getbytes ("UTF-8 "). length; // output 9, 2*3 + 3 = 9. A Chinese Character occupies 3 bytes, and an English letter occupies 1 byte. <br/> system. out. println ("UTF-8:" + Len); <br/> Len = MSG. getbytes ("UTF-16 "). length; // OUTPUT 12 <br/> system. out. println ("UTF-16:" + Len); <br/> Len = MSG. getbytes ("UTF-32 "). length; // output 20 <br/> system. out. println ("UTF-32:" + Len); <br/> Len = MSG. getbytes ("Unicode "). length; // OUTPUT 12 <br/> system. out. println ("UNICODE:" + Len); <br/>} catch (Java. io. unsupportedencodingexception e) <br/>{< br/> system. out. println (E. getmessage (). tostring (); <br/>}</P> <p>}
Program output:
China ABC
7
Gb2312: 7
GBK: 7
Gb18030: 7
UTF-8: 9
UTF-16: 12
UTF-32: 20
UNICODE: 12
Analysis:
Len = MSG. getbytes (). the value of length is 7, because the Windows 7 operating system character encoding is GBK (gb2312, GBK or gb18030), Java uses the default Operating System encoding to encode characters when running the program, therefore, the number of characters in bytes is 7.
If this program is placed,
[Plain]
View plaincopyprint?
- [Zhankunlin @ icthtc javatest] $ export lang = zh_cn.gb18030
- [Zhankunlin @ icthtc javatest] $ Vim charset. Java (when writing a Java code file, the code used is zh_cn.gb18030, that is, the code file encoding is gb18030)
- [Zhankunlin @ icthtc javatest] $ javac charset. Java
- [Zhankunlin @ icthtc javatest] $ Java charset (lang = zh_cn.gb18030, that is, the default encoding is gb18030)
- China ABC
- 7 (the default encoding is gb18030, so it occupies 7 bytes)
- Gb2312: 7
- GBK: 7
- Gb18030: 7
- UTF-8: 9
- UTF-16: 12
- UTF-32: 20
- UNICODE: 12
- [Zhankunlin @ icthtc javatest] $ export lang = zh_CN.UTF-8 (Change System code to UTF-8)
- [Zhankunlin @ icthtc javatest] $ Java charset
- Juan .. ABC (because the xshell Terminal code is not set to UTF-8, so the print garbled)
- 9 (the operating system encoding is UTF-8, so it occupies 9 bytes)
- Gb2312: 7
- GBK: 7
- Gb18030: 7
- UTF-8: 9
- UTF-16: 12
- UTF-32: 20
- UNICODE: 12
[Zhankunlin @ icthtc javatest] $ export lang = zh_cn.gb18030 <br/> [zhankunlin @ icthtc javatest] $ Vim charset. java (when writing a Java code file, the code used is zh_cn.gb18030, that is, the code in the code file is gb18030) <br/> [zhankunlin @ icthtc javatest] $ javac charset. java <br/> [zhankunlin @ icthtc javatest] $ Java charset (lang = zh_cn.gb18030, that is, the default encoding is gb18030) <br/> China ABC <br/> 7 (the default system encoding is gb18030, so it occupies 7 bytes) <br/> gb2312: 7 <br/> GBK: 7 <br/> gb18030: 7 <br/> UTF-8: 9 <br/> UTF-16: 12 <br/> UTF-32: 20 <br/> UNICODE: 12 <br/> [zhankunlin @ icthtc javatest] $ export lang = zh_CN.UTF-8 (Change System code to UTF-8) <br/> [zhankunlin @ icthtc javatest] $ Java charset <br/> Juan .. ABC (because the xshell terminal encoding is not set to UTF-8, so the printing garbled) <br/> 9 (the operating system encoding is UTF-8, So 9 bytes) <br/> gb2312: 7 <br/> GBK: 7 <br/> gb18030: 7 <br/> UTF-8: 9 <br/> UTF-16: 12 <br/> UTF-32: 20 <br/> UNICODE: 12[Plain]
View plaincopyprint?
- {Set the xshell terminal encoding to UTF-8}
{Set the xshell terminal encoding to UTF-8}[Plain]
View plaincopyprint?
- [Zhankunlin @ icthtc javatest] $ Java charset
- China ABC (print normal)
- 9
- Gb2312: 7
- GBK: 7
- Gb18030: 7
- UTF-8: 9
- UTF-16: 12
- UTF-32: 20
- UNICODE: 12
- [Zhankunlin @ icthtc javatest] $ Vim charset. Java
[Zhankunlin @ icthtc javatest] $ Java charset <br/> China ABC (print normal) <br/> 9 <br/> gb2312: 7 <br/> GBK: 7 <br/> gb18030: 7 <br/> UTF-8: 9 <br/> UTF-16: 12 <br/> UTF-32: 20 <br/> UNICODE: 12 <br/> [zhankunlin @ icthtc javatest] $ Vim charset. java[Plain]
View plaincopyprint?
- [Zhankunlin @ icthtc javatest] $ javac charset. java (the program code file encoding is gb18030, while the system encoding is UTF-8 at the time of compilation, the compiler will read the code file for compilation in the way of operating system encoding if there is no specified, so there is a warning)
- Charset. Java: 6: bytes ?. Forbidden .??. Utf8 ?.??.. Rejected... Enabled
- String MSG = "ABC ";
- ^
- Charset. Java: 6: bytes ?. Forbidden .??. Utf8 ?.??.. Rejected... Enabled
- String MSG = "ABC ";
[Zhankunlin @ icthtc javatest] $ javac charset. java (the program code file encoding is gb18030, while the system encoding is UTF-8 at the time of compilation, the compiler will read the code file for compilation in the way of operating system encoding if there is no specified, so there is a warning) <br/> charset. java: 6: runtime ?. Forbidden .??. Utf8 ?.??.. Forbidden .. Sorry <br/> string MSG = "ABC"; <br/> ^ <br/> charset. Java: 6: forbidden ?. Forbidden .??. Utf8 ?.??.. Summary .. summary <br/> string MSG = "ABC ";[Plain]
View plaincopyprint?
- [Zhankunlin @ icthtc javatest] $ javac-encoding gb18030 charset. Java (if you use the-encoding option to specify the encoding format of the program file, compilation will not fail)
- [Zhankunlin @ icthtc javatest] $ Java charset {print normal, because the xshell terminal encoding has been set to UTF-8 }}
- China ABC
- 9
- Gb2312: 7
- GBK: 7
- Gb18030: 7
- UTF-8: 9
- UTF-16: 12
- UTF-32: 20
- UNICODE: 12
[Zhankunlin @ icthtc javatest] $ javac-encoding gb18030 charset. java (if you use the-encoding option to specify the encoding format of the program file, the compilation will not fail.) <br/> [zhankunlin @ icthtc javatest] $ Java charset {print normal, because the xshell terminal encoding has been set to UTF-8 }}< br/> China ABC <br/> 9 <br/> gb2312: 7 <br/> GBK: 7 <br/> gb18030: 7 <br/> UTF-8: 9 <br/> UTF-16: 12 <br/> UTF-32: 20 <br/> UNICODE: 12 <br/>[Plain]
View plaincopyprint?
- <PRE>
<PRE>[Plain]
View plaincopyprint?
- </PRE> <PRE name = "code" class = "plain">
</PRE> <PRE name = "code" class = "plain">[Plain]
View plaincopyprint?
- </PRE> <PRE name = "code" class = "plain">
</PRE> <PRE name = "code" class = "plain">[Plain]
View plaincopyprint?
- </PRE> <PRE name = "code" class = "plain"> <PRE>