Recently in parsing a pile of files, there are GBK, UTF-8, encountered garbled problem in the process, I believe many people who have done the relevant work have similar experience.
There are several areas related to Chinese encoding:
1. encoding of the original file (input encoding)
2. encoding of output)
3. Eclipse default character set encoding (Project-> right-click Properties-> text file encoding)
A file is essentially a byte stream:
File: byte1 byte2 byte3 byte4 byte5 ....
However, the basic unit of string in Java is Char:
String: char1 char2 char3 char4 ....
Therefore, when reading files in Java, there is a process of converting byte into CHAR:
(Byte1 byte2) (byte3) (byte4 byte5 )....
Char1 char2 char3...
Character Set encoding is required for the transformation from byte to Char.
By default, bytes is converted to Char using the default Character Set of Eclipse.
If the encoding of the original file is GBK, but the default Character Set of eclipse is UTF-8. When reading files, the system uses the wrong character set for encoding. This is why Chinese characters are garbled.
The solution is to specify the character set encoding when creating the reader, for example:
Assume that the character set encoding of the read file is GBK:
Inputstreamreader ISR = new inputstreamreader (New fileinputstream (file), "GBK ");
Bufferedreader reader = new bufferedreader (ISR );
String line = reader. Readline ();
In this way, line is the correct string encoded according to GBK.
If you want to write GBK-encoded line into a file in the form of a UTF-8, you can:
Outputstreamwriter OSR = new outputstreamwriter (New fileoutputstream (OUTFILE), "UTF-8 ");
Bufferedwriter writer = new bufferedwriter (OSR );
//.....
Writer. Write (line );
******** ******************************
Add a good link for Java Chinese encoding:
Http://www.ibm.com/developerworks/cn/java/j-lo-chinesecoding/index.html