1. What is encoding and why is it encoded?
Previously did not think so deep in the question, feel it all take for granted, until one day Java garbled let me kneel, he is not listening to my words, everywhere is garbled, this time I am not going to let it go, I want to clean it up.
As we all know, the text files, stored on the hard disk, are a bunch of binary, 01 combination, it is not carry whatever, even a tiny little, information to tell the text editor hi Buddy son, I am GBK code, I am UTF-8
Code, I am ...
It is very easy, is 01 of the combination, he does not know what he is, so we just have to know what he is and then the right editor appropriate to use it, first look at a sample.
This is a text file whose binary form is a string of 01
When I am using the encoding of the terminal is GBK display (first time)
When I use UTF-8 (second time)
He's just so simple. There are so fickle, very fortunate, this string 01 whether in GBK encoding, or in UTF-8 are valid characters that is, by UTF-8 encoding rules it represents
"Han" and assume according to GBK coding rules to parse it is also "Cha", this is the root, encoding and decoding is not a code. So in most cases you don't understand how the document is going to be messed up. This is required
The reason to encode. Binary files A string of 01 computers recognize but for him it is 01 meaningless. So we have to translate it into something that everyone can recognize. A kind of coding is a kind of mapping to me, that is, to what I just said
The GBK ring
Under the environment it mapped into a thing, UTF-8, under the circumstances it mapped into a thing, but the essence of them there is a thing, that Sir again asked, clearly again UTF-8 and GBK environment are the same, for example A, this is said to,
Compatibility problem between code and code,
The friends who want to explore themselves can explore the rules and compatibility of each code, there is not much to say.
2. Getting Started code
To write a Java file, we'll start with a HelloWorld.
This is the same file I see in different coding environments, see things that are not the same as my system is Ubuntu 12.04 default encoding UTF-8 But in advance declare that I this file but GBK encoded so in GBK form of the codec
Code view only makes sense, Zhang San and work, the bottom to compile my HelloWorld file, see what miracle?
I go, not live, HelloWorld can not make it all wrong, this day can't live. Just don't worry, see the error, this character (of course, this is not Zhang San representative of the character can not be found, but again GBK represents Zhang San 0101
Binary string in the UTF-8 cannot find the map so it can't parse, we can understand the code rules UTF-8 UTF-8, yes, you can't find it right, you do not compile everywhere to execute it? Compiling all over the way how
Come on, don't worry. Javac has a parameter,-encoding <encoding> specify character encoding used by source files that is to say you tell Javac you this Java source File is what kind of code you don't give
Make a mistake, if you make a mistake, fortunately, compile just to tell you where there is a problem, if the unfortunate Han becomes the cha to think about is terrible so do not error does not mean that the program no problem , do not believe you try, I just said
, my system by default is UTF-8 so did not compile in the past this is GBK encoded source files, most people use the Windows, and installed the Chinese language pack, so the default encoding is mostly GBK so it is OK to compile
GBK encoded source files, so generally no problem, but most of the program is finally on-line execution of the environment but Linux Ah, all of you set the virtual machine is-DFILE.ENCODING=GBK This is who do not dare to lazy.
(It's just a little early to say this)
When I add this reference, the compilation goes through.
Feel very sacred and execute my Helloword
UTF-8 Environment for
I'm so excited, I finally got out.
GBK Environment (This is the reason most garbled)
It's not science, it's a mess, God.
We certainly did not forget what Java class file is encoded AH Unicode yes so Zhang San this string so compiled, he is not GBK all the source files, no matter what the encoding is the same as the class file encoding, you did not tell the JVM what encoding to use, So it defaults to UTF-8, so Zhang San is parsed into UTF-8 encoding corresponding to a string of binary, but your output environment is GBK so ....
So I honestly added (this command line environment is still GBK)
What is this out in the System.out? PrintStream is an output stream AH (the default is to flow to the console of course you can go through System.setout (PrintStream out) to let him flow elsewhere, such as text, and then use EditPlus hit
Take a look, by setting different decoding, to see different display, of course, it is not necessary to print to the console, this stream you intend to use what code to show it, GBK environment is certainly in the form of GBK, but the first time is garbled for
What, the JVM is thinking like this
I have a Unicode encoded Zhang San here I'm talking about my system UTF-8 when I'm not telling the JVM what encoding rules to encode, it's sure to use UTF-8 to compile and you get an input stream that is a string of
1112 binary represents UTF-8 under the Zhang San, and the output to the screen here screen (this is where the output flow direction) Do not eat this set of 00001111 does not represent Zhang San, representing the top three I also do not know the word
(embarrassed), so we have to tell the virtual
Machine, I want the Zhang San is GBK encoded Zhang San, I want a string of binary corresponding GBK Zhang San, so add a this is good. Let's write it down here! It's too late to catch the second bus.
The next time I write a code about IO, I feel a little bit simpler than that. (with the learning of some Java and coding-related class effects more)
First write here, I wish you all good play in National day. Do not forget to masturbate two lines of code for a pleasant feeling, because a few days, she could not have forgotten you.
In-depth parsing of Java garbled