The summary Windows system defaults to the GBK character set, resulting in the inability to use UTF-8 decoding. This article begins with a description of the character set used on Windows, and then analyzes the character set relationships between Java,. class, Javac, and the character set relationships between the source files, binaries, and compilers of C/b + + projects that analyze vs. In the end, it is best to use the-encoding parameter to specify the character set used by the. java file in order to avoid the javac of non-recoverable Chinese characters .
"Problem Recurrence"
In a Java project, project output is garbled because the source file store uses a different character set . When using GBK storage source files, the normal characters out "Chinese", and the use of UTF-8 storage source files, but output garbled.
For the use of UTF-8 storage source files, but output garbled:
For the use of GBK storage source files, normal output:
"One, Windows system default character set"
In 1980, China set up a gb2310-80, a total of 7,445 characters included. In 1993, the Unicode 1.1 version, which included 20,902 Chinese characters, developed a "GB 13000.1-93" equivalent to the Unicode 1.1 version, referred to as GB13000. Microsoft extended the gb2312-80, and included GB13000 and Unicode1.1 in the Chinese characters, developed a GBK code. CP936 is represented in Windows using the code page. For example, use the CHCP command in the console to view the character set used by Windows.
GBK is the default character set for the Windows Chinese system .
"Ii. vs. C + + project source files, binary text and compiler relationships"
The VC compiler version used by the author is 19.00.24210, using the VS version of VS2015.
It is well known that each file is saved with the specified character set selected, that is, the character set used when the source file is saved on Windows can be selected in the form GBK and UTF-8, on the Chinese Windows 7 system, the character set of the default storage source file for the VS is GBK.
After compiling the binary executable with the VC compiler, the character set used by the binaries conforms to the following table. The UTF8 with the BOM means that the file is preceded by three characters as the BOM header, and the identity file uses the UTF8 character set .
Source file Character Set |
Compiled binary file character set |
GBK |
GBK |
UTF-8 (with BOM) |
GBK |
UTF-8 |
UTF8 |
"Third, Java." The relationship between Java,. class, JVM, Output console
In Ali, many people use IntelliJ idea as the IDE to develop Java applications, while IntelliJ idea uses the UTF8 character set by default, such as the IDE encoding means that the entire IDE uses UTF8 encoding, Project Encoding indicates that this project uses UTF8 encoding.
In Java,. java files and. class files have character set relationships in the following table, such as the string in the "Chinese" string Str,.class in. Java in three cases: ①.java is saved in GBK format, that is, Str saves the content "Chinese" in GBK format, After Javac compiled, the. Class Str becomes UTF-8 saved "Chinese"; ②.java is saved with UTF-8 (no BOM), that is, Str saves the content "Chinese" in UTF-8, after Javac compilation, the. class str changes to UTF-8 saved "trickle PO ③.java is stored as UTF-8 (with BOM) and cannot be compiled.
. java file Character set |
. class file Character set |
GBK |
UTF-8 |
UTF-8 (no BOM) |
UTF-8 (but Chinese is garbled) |
UTF-8 (with BOM) |
Compilation failed unable to build. class file |
For the second case above, why does the. class keep the UTF-8 garbled? This is because. class must be using the Unicode character set, which is compatible with UTF8, and. Java can use any character set. The Java build process uses the character set as follows: ". Java (arbitrary encoding),. Class (Unicode)-In-JVM (Unicode). garbled is due to javac the UTF8 format of the. java file as the GBK format , because Javac can be specified by-encoding. Java character set, without specifying the case will default. Java uses character sets for the system. Since no use of-encoding,javac will already be UTF-8. java files are treated as GBK files, resulting in garbled characters. Specifically visible: https://www.zhihu.com/question/30977092
Windows. Java and. class file character set encoding relationships and includes a similar analysis on C + +