Java encoding/decoding--garbled problems and solutions that may occur in various environments _

Java encoding/decoding--garbled problems and solutions that may occur in various environments __java

Last Update:2018-07-28 Source: Internet

Author: User

Tags array to string wrapper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Often encountered in the work of Java coding problems, due to lack of research, always can not give a definitive answer, this weekend on the Internet to check some information, do some summary here.

Question one: What encoding should be used to read files in Java.

The way Java reads files can be grouped into two categories: read by Byte and read by character. Reading by Byte is using the Inputstream.read () method to read bytes, and then save to a byte[] array, often using new string (byte[), and converting the byte array to string. Hides a coded detail in the last step, new String (byte[]), uses the operating system default character set to decode the byte array, and the Chinese operating system is GBK. The bytes we read from the input stream are probably not GBK encoded, because the byte encoding that is read from the input stream depends on the encoding of the file itself being read. For example: We create a new file named Demo.txt on the D: Disk and write "we." "and save it. At this point demo.txt encoding is ANSI, Chinese operating system is GBK. The byte that we read the file with the input stream is the byte encoded using the GBK method. So we end up with the new string (byte[]), and it's no problem to use the platform default GBK to encode a string (byte encoding and default decoding are consistent). Imagine that if we choose UTF-8 encoding when saving demo.txt files, then the encoding of the file is not ANSI, and becomes UTF-8. Still using the input byte stream to read, then the bytes read at this time is different from the previous one, this time the byte is UTF-8 encoded bytes. Two times byte is obviously different, a very obvious difference is: GBK each Chinese character two bytes, and UTF-8 each Chinese character three bytes. How we end up with a new string (byte[]), to construct a String object, will appear garbled for the simple reason that the default decoding GBK used at construction time, and our bytes are UTF-8 bytes. The correct approach is to use the new string (byte[], "UTF-8") to construct the String object. At this time our byte encoding and construction use of the decoding is consistent, there will be no garbled problem.

Say the byte input stream, and then the byte output stream.

We know that if the byte output stream is used to output bytes to a file, we cannot specify the encoding of the generated file (assuming the file did not exist before), then what is the generated file encoded. After testing, it turns out that this depends on the byte encoding format being written. For example, the following code:

OutputStream out = new FileOutputStream ("D:\\demo.txt");

Out.write ("we". GetBytes ());

GetBytes () encodes bytes using the operating system's default character set, which is GBK, so we write demo.txt files with GBK encoded bytes. So the code for this file is GBK. If you modify the program slightly: Out.write ("we". GetBytes ("UTF-8")), the byte we write at this point is UTF-8, then demo.txt file encoding is UTF-8. Here's another point, if you change "we" to ASCII characters like 123 or ABC, then the generated files will be GBK encoded either with GetBytes () or GetBytes ("UTF-8").

To sum up, the byte encoding in InputStream depends on the encoding of the file itself, and the encoding of the outputstream generated file is dependent on the byte encoding.

Here's how to read a file using a character input stream.

First, we need to understand character streams. In fact, a character stream can be viewed as a wrapper flow, which reads bytes at the bottom or byte stream, and then decodes the read bytes into characters using the specified encoding. Say the character stream, have to mention is InputStreamReader. The following is a description of the Java API: InputStreamReader is a bridge for byte-flow to character streams: it reads bytes using the specified charset and decodes them into characters. It uses a character set that can be specified by name or explicitly given, or it may accept the platform default character set. In fact, it is clear that the inputstreamreader at the bottom or the byte stream to read bytes, read bytes It requires a coded format to decode the read bytes, if we are in the construction InputStreamReader no incoming encoding, The operating system default GBK is then used to decode the read bytes. Also using the example above demo.txt, assuming that the Demo.txt encoding is GBK, we use the following code to read the file:

InputStreamReader in = new InputStreamReader (New FileInputStream ("Demo.txt"));

Then we read will not generate garbled, because the file using GBK encoding, so read the byte is also GBK encoded, and InputStreamReader default decoding is also GBK. If the Demo.txt encoding mode is replaced by UTF-8, then we read it in this way will produce garbled characters. This is due to byte encoding (UTF-8) and our decoding code (GBK). The solution is as follows:

InputStreamReader in = new InputStreamReader (New FileInputStream ("Demo.txt"), "UTF-8");

Specifies the decoding code for the InputStreamReader, so that the two will not appear garbled.

Let's say the character output stream.

The principle of the character output stream is the same as that of the character input stream, which can be considered as a wrapper flow, and the bottom or the byte output stream is used to write the file. Only the character output stream converts characters to bytes according to the specified encoding. The main class of the character output stream is: OutputStreamWriter. The Java API is interpreted as follows: OutputStreamWriter is a bridge between character flows to byte streams: the encoding of characters to write to the specified charset is encoded in bytes. It uses a character set that can be specified by name or explicitly given, or it may accept the platform default character set. It's clear, it needs an encoding. Converts the characters written to bytes, if not specified then uses GBK encoding, then the output bytes will be GBK encoded, the resulting file is also GBK encoded. If you construct outputstreamwriter in the following ways:

OutputStreamWriter out = new OutputStreamWriter (New FileOutputStream ("Dd.txt"), "UTF-8");

The characters written will be encoded as UTF-8 bytes, and the resulting file will be UTF-8 format.

Question two: Since read the file to use and file encoding consistent encoding, then Javac compile the file also need to read the file, it uses what code.

This question has never been thought of, and has never been a problem. It is because of the problem of thinking, in fact, there is still something to dig. The following three scenarios are discussed, and these three scenarios are common methods for compiling Java source files.

1.javac compiles Java class files on the console.

Usually we manually build a Java file Demo.java and save it. At this time Demo.java file encoding for ANSI, Chinese operating system is GBK. Then use the Javac command to compile the source file. "Javac Demo.java". Javac also needs to read Java files, so Javac is using what encoding to decode the bytes we read. In fact, JAVAC using the operating system default GBK encoding decoding we read bytes, this code is also Demo.java file encoding, the two are consistent, so there will be no garbled situation. Let's do something, when we save the Demo.java file, we choose to save the UTF-8. At this point the Demo.java file encoding is UTF-8. We then use "Javac Demo.java" to compile, if Demo.java

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More