Java garbled characters for reading Chinese Characters

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in parsing a pile of files, there are GBK, UTF-8, encountered garbled problem in the process, I believe many people who have done the relevant work have similar experience.

There are several areas related to Chinese encoding:
1. encoding of the original file (input encoding)
2. encoding of output)
3. Eclipse default character set encoding (Project-> right-click Properties-> text file encoding)

A file is essentially a byte stream:
File: byte1 byte2 byte3 byte4 byte5 ....
However, the basic unit of string in Java is Char:
String: char1 char2 char3 char4 ....
Therefore, when reading files in Java, there is a process of converting byte into CHAR:
(Byte1 byte2) (byte3) (byte4 byte5 )....
Char1 char2 char3...

Character Set encoding is required for the transformation from byte to Char.
By default, bytes is converted to Char using the default Character Set of Eclipse.
If the encoding of the original file is GBK, but the default Character Set of eclipse is UTF-8. When reading files, the system uses the wrong character set for encoding. This is why Chinese characters are garbled.

The solution is to specify the character set encoding when creating the reader, for example:
Assume that the character set encoding of the read file is GBK:
Inputstreamreader ISR = new inputstreamreader (New fileinputstream (file), "GBK ");
Bufferedreader reader = new bufferedreader (ISR );

String line = reader. Readline ();
In this way, line is the correct string encoded according to GBK.

If you want to write GBK-encoded line into a file in the form of a UTF-8, you can:
Outputstreamwriter OSR = new outputstreamwriter (New fileoutputstream (OUTFILE), "UTF-8 ");
Bufferedwriter writer = new bufferedwriter (OSR );
//.....
Writer. Write (line );

******** ******************************

Add a good link for Java Chinese encoding:

Http://www.ibm.com/developerworks/cn/java/j-lo-chinesecoding/index.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java garbled characters for reading Chinese Characters

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support