Easy to understand Chinese garbled problem (1)---cross-platform garbled

Source: Internet
Author: User
Tags mysql insert

Originally was only intended to write an article about the Chinese garbled blog, but found to talk about the span a bit big, not good to write in the same article, so separate.

Another article is "easy to understand Chinese garbled problem (2)---Analysis to resolve the MySQL insert mobile side emoticons error ' Incorrect string value: ' \xf0 ... ' .

This article focuses on the theory of coding, and another focuses on solving problems and ideas.

First, the beginning of the problem

Chinese garbled problems often occur in actual projects, especially in the case of inexperienced teams with insufficient estimation of the problem. Web development, social chat, and other applications that are not controlled by input information are often hardest hit. In addition to the hot mobile internet, emerging characters and expressions are also beginning to be used frequently, if not enough support, the user experience is a disaster. Therefore, in the beginning of the design system, the control character must be strictly encoded. In general clear how to do is not wrong ( strictly according to the rules are often difficult to control, and easy to miss out of the loopholes, so do not make mistakes is the starting point ).

Second, Unicode

Unicode is the focus of this article.

Unicode is a universal character set, which is the definition of a character . There are also the corresponding, such as ISO 8859-1. But Unicode is widely used and is a standard in the industry, so we can think of Unicode as the definition of a character in a computer, and in memory the performance is 0,1 string.

Also, Unicode is very clean, and it defines unique code for characters rather than glyphs. in other words, the Uniform Code handles characters in an abstract way (that is, numbers) and leaves the visual deductive work (such as font size, appearance, font shape, style, etc.) to other software, such as a Web browser or word processor. For example, such as "ɑ/A", "Qiang/Strong", "User/household/kobe". (quoted from wiki)

In this way, the task of Unicode itself is clear, which is the appropriate extension code.

Third, UTF-8

UTF-8 is a way of implementing Unicode, which is a variable-length character encoding . The corresponding GBK (fixed length), latin,utf-16 (full Unicode) and so on.

Why is this the way it is implemented? Isn't Unicode already defined? In fact, this is also a frequent method of computer science, these different codes are similar to the various trick for Unicode.

For a simple example, an array of a[], then the definition for a[] is equivalent to Unicode. So is the use of the fast row, heap row or merge, with the positive or reverse order of the results, these methods are compared to the encoding format . and these are the specific implementations of this definition, but the methods are different.

Therefore, both UTF-8, GBK,Latin , and so on, the encoding result of the restoration is the same Unicode encoding.

Iv. relationship between Unicode and UTF-8 and the conversion process

Then the relationship between Unicode and UTF-8 can be understood in the above example. But the real situation should be something like the following.

For example, the Unicode encoding of the word "state" is defined as 00000000 00000000 00110100 11000000 (hypothesis).

Since its low 16 bits are all 0, in order to reduce the storage and transmission of the word on the byte of waste, choose High 16 bits to represent. Also, since UTF-8 is variable-length, it is necessary to identify the bit to identify how many bytes the encoding is using.

So the "country" word corresponding to the UTF-8 code should be 11100011 010011 000000 (Bold is the original code of the high 16-bit).

Conversion formula:1110xxxx (e0-ef) yyyyyy zzzzzz , the display marked the mark ).

The following is a UTF-8 encoding method: (from wiki)

Five, Chinese cross-platform garbled and solutions

With the accumulation of the above knowledge, we can analyze why the cross-platform will appear garbled? Clearly good Unicode how to mess it up?

So very intuitive we would think that the encoding format is not compatible .

For Windows platforms, the encoding format is GBK, and the corresponding kanji is two bytes long. For Linux platforms, the encoding format is UTF-8, and the corresponding Kanji is 3 bytes. (This is the default)

Then we also use the above example to explain the order .

< Span class= "Unicode" >< Span class= "Unicode" > For example now < Span class= "Unicode" >unicode defined with {} , GBK represents a positive order, and UTF-8 represents a reverse order. Now the Unicode encoding under GBK is {UTF-8}, and the encoding under the code is {3,2,1}.

< Span class= "Unicode" >< Span class= "Unicode" > Now the Unicode encoding is restored by GBK encoding, then the forward parsing GBK is Unicode. and by utf-8 encode restore UN Icode encoding, reverse parsing required utf-8 encoding. It all corresponds to itself.

< Span class= "Unicode" >< Span class= "Unicode" > But if a GBK encoding is mistaken for utf-8, then the result of reverse parsing is {3,2,1}. First the result is not the original Unicode encoding, then the result of the conversion is not what we need. Second, it is possible that this encoding result is not defined in Unicode, and there may be white space characters like spaces.

Therefore this is the cross-platform Chinese garbled reason, the encoding and decoding method has the difference.

workaround:

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" > There are many solutions, depending on your application, you can choose to encode the code side of the conversion (such as the Java string str = new String (Str.getbytes ("GBK"), "UTF-8");), You can also adjust the encoding format at the input.

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" > But it boils down to only one point, that is, if The current input is GBK encoded, and you need UTF-8 encoding , then:

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >1. Converted to Unicode using the GBK decoding method.

2. Encode using the UTF-8 encoding.

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" > VI, Chinese coding some interesting applications

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" > Here I only think of the 11 game battle platform above some funny name (accidentally seemingly exposed what.) But a long time not to play 11), the back of the thought of other will be updated.

< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" >< Span class= "Unicode" > For example the following figure:

The player's name inside the blue circle is normal and green. The red circle is blue, which is equivalent to breaking through 11 of the client's restrictions show other special colors.

This is done by adding |r to the corresponding name. This is an escape character that reasonably leverages the 11 platform to open the user's character set and produces a special effect. So this is what I started to say the strict control of the character set is very difficult, controlled to not error (such as the system garbled) has been very good.

Reprint please indicate the source, thank you ~

Easy to understand Chinese garbled problem (1)---cross-platform garbled

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.