Processing of UTF-8 encoding by Java output stream

Source: Internet
Author: User
Tags 0xc0

Recently, I was modifying the problem of uploading a Chinese path to the FTP server using Windows, which had a headache for a long time.

When the number of characters read from Windows is odd, the path is partially identified as garbled characters. If the number of Chinese characters is an even number, it can be identified normally.

The following is why I saw an analysis by Daniel on the Internet. After reading the analysis, I realized that it was originally a ghost of the Java output stream.

Reference:

UTF-8 recently caused various headaches. I almost had to figure out the eye of binary encoding. Let's take a look at the small problem we encountered today.
A Java file written in GBK format seems to have a feature that automatically replaces 0 × 0 which cannot be recognized in GBK encoding "?", That is, 0x3f. This is nothing, but when he reads utf8 text and writes new files again, there will be a deviation.
As we all know, the English and symbols of UTF-8 encoding are compatible with ASCII encoding. At the same time, in most Chinese, the bytes 0 × 0 are rarely displayed. Therefore, use ASCII to hard write utf8-encoded Chinese-English mixed text, and then use utf8 to open the file. If you are lucky, you can be lucky enough to get away with it without error. However, once there is a UTF-8 encoding with 0 × 0 bytes in the text, it is not so lucky.
For example, the Unicode encoding of the character "personality" is 0 × 6027. If it is translated into utf8:

Sex 60 27 01100000 00100111 [1110] 0110 [10] 000000 [10] 100111 E6 80 A7

Brackets are used to identify the UTF-8 header for easy identification. In Windows, a small sequence is used, so the actual sequence is as follows:

0110 1110 0000 1000 0111 1010

Here the third byte appears 0 × 0, and the tragedy has occurred. It is intelligently replaced by the Java stream output "?"

0110 1110 1111 0011 ...6    E    F    3

The result is garbled. The content is a mess. It becomes "<E6> ?".

Another problem that once gave me a headache for one day is that according to the UTF-8 encoding rules, there will be no dual character 0xc0 0 × 80. Because if you follow the encoding rules

 0xC0           0x80[110]0 0000    [10]00 0000 0000  0000

This is actually 0x0! It cannot be encoded as this. On the Wiki, I found that this was a proprietary solution for Windows. It is entirely a wishful thinking in the Windows system. It is a tough task to distinguish it from common ASCII text. As a result, if you want to insert such text into the MySQL database, it will be determined by the database as an unreasonable utf8 encoding and denied by the cold!
So sort out the utf8 encoding problem. Sometimes we still need to keep seeing the problem above the binary. This will make it clearer where the problem occurs. How to solve it is no longer like shotgun programming, I don't know what went wrong in the left trial or right trial.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.