Java Chinese garbled solution (a)-----recognize character sets

Last Update:2015-01-04 Source: Internet

Author: User

Tags coding standards control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Quiet for a long time (about three months), LZ "can't restrain" began to write Bo!

Java coding in the Chinese problem is a commonplace problem, every time encountered in Chinese garbled LZ is either in accordance with previous experience, or is baidu.com to solve the problem. Read a lot about the Chinese garbled solution of the blog post, found that the problem we all (more including myself) do not have a clear understanding, so LZ want to through this series of blog (estimated only a few) to thoroughly analyze, solve the Java Chinese garbled problem, if there are errors in the hope that colleagues point out! Of course, this series is not the LZ completely original, are based on the predecessor summary, induction, if the same is purely for reference ...

Problem Origin

For a computer, it knows only two 0 and 1, whether in memory or on an external storage device, the text, pictures, videos, and so on that we see "data" are in binary form on the computer. The rules for binary numbers for different characters are the encoding of the characters. A collection of character encodings is called a character set.

In the early computer systems, the use of the characters are very few, they only include 26 English letters, number symbols and some common symbols, to encode these characters, with 1 bytes is sufficient, but with the continuous development of computers, in order to adapt to the language of other countries around the world, These little pathetic character encodings are certainly not enough. Therefore, Unicode encoding is proposed, which uses double-byte encoding, which is compatible with English characters and the double-byte character encoding of other nationalities.

Each country in order to unify the code will stipulate that the country/region computer Information interchange with the character set encoding, in order to solve the local character information computer processing, so there are a variety of localized versions, introduced Lang, Codepage and other concepts. Most of the software core character processing with internationalized features is now based on Unicode, which determines the local character encoding settings based on the Locale/lang/codepage settings at the time of the software operation and handles local characters accordingly. The conversion between Unicode and local character sets is required during processing.

In the same way, the Java internal use of Unicode encoding, so in the process of Java will inevitably exist from the Unicode encoding and the corresponding computer operating system or browser-supported encoding format of the conversion process, this conversion process has a series of steps, if a step error, Then the output text will be garbled.

So the problem with Java garbled is that the JVM has an error encoding format conversion with the corresponding operating system/browser.

In fact, to solve the problem of Java garbled method is relatively simple, but to investigate its reasons, the principle behind understanding is still need to understand

In fact, the method of solving the Chinese character coding problem in JAVA program is often very simple, but understanding the reason behind it, locating the problem, also need to understand the existing Chinese character coding and encoding conversion.

Common character encodings

To accurately handle a variety of character set text, the computer needs character encoding, so that the computer can recognize and store a variety of text. Common character encodings include: ASCII encoding, gb** encoding, Unicode. The following LZ is simply introduced below! (Why is it a simple introduction?) Because the LZ in the Internet to find information to understand the character encoding, found that this problem than I imagined more complex, so LZ need another a detailed introduction, so you onlookers simply look at it!! ）

1.ASCII Encoding

Ascii,american Standard code for information Interchange is a set of computer coding systems based on the Latin alphabet, used primarily to display modern English and other Western European languages. It is the most versatile single-byte encoding system today.

The ASCII code uses the specified 7-bit or 8 for binary numeric combinations to represent 128 or 256 possible characters. The standard ASCII encoding uses a 7 (2^7 = 128) bit binary number to denote that all uppercase and lowercase letters, numbers, and punctuation marks have some special control characters, and the first one is a uniform rule of 0. of which 0~31 and 127 (a total of 33) are control characters or communication-specific characters, 32~126 (a total of 95) is a character (32 is a space), where 48~57 is 0 to 90 Arabic numerals, 65~90 26 uppercase English letters, 97~122 number 26 lowercase English letters, The rest is some punctuation marks, arithmetic symbols, and so on.

2.gbk*** Encoding

The biggest disadvantage of ASCII is that the display character is limited, although he solves some of the Western European language display problem, but to more other languages he is really incompetent for. With the development of computer technology, the use of the scope is more and more widespread, the shortcomings of ASCII more and more obvious, other countries and regions need to use computers, must design a set of national/local coding rules. For example, in order to display Chinese, we have to design a set of encoding rules for converting Chinese characters to the number of digital systems acceptable to the computer.

GB2312, for the exchange of information between Chinese character processing and Chinese communication systems, is available in mainland China. Its coding rules are: less than 127 characters of the same meaning, but two more than 127 words connect prompt together, it represents a Chinese character, the front of a byte (what he calls a high-byte) from the 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, This allows us to assemble about 7,000 + Simplified Chinese characters. Although GB2312 included so many men, he covered the use rate can reach 99%, but for those uncommon Chinese characters, such as people's names, place names, ancient Chinese, it can not be processed, so there is the following GBK, GB 18030 appears. (Click GB2312 Simplified Chinese encoding Table to view).

GB18030, full name: National standard GB 18030-2005 "information Technology Chinese coded character set" is one of the basic standards that our computer system must follow, There are two versions of GB18030: gb18030-2000 and gb18030-2005. GB18030-2000 is a replacement version of GBK, and its main feature is the addition of CJK Unified Chinese characters in the GBK basis.

GB 18030 mainly has the following characteristics:

The same as UTF-8, with multibyte encoding, each word can consist of one, 2, or 4 bytes.

The encoding space is large and can be defined up to 1.61 million characters.

Support the Chinese national minority's language, does not need to use the word-writing area.

Chinese characters included in traditional Chinese characters and Japanese and Korean characters

GBK, one of the Chinese character coding standards, the full name of "Chinese character Code Extension Code", which is compatible with the GB 2312 encoding, upward support of ISO 10646.1 International standards, is the former transition to the latter in the process of a connecting standard. Its coding range is as follows:

Unicode encoding

As mentioned earlier, there are so many countries in the world, there are a variety of coding styles, such as the Chinese GB232, GBK, GB18030, such a mess, although there is no problem in the local operation, but once on the network, because of incompatible, access will appear garbled. To solve this problem, the great Unicode encoding was vacated.

The role of Unicode encoding is to enable the computer to achieve the platform, cross-language text conversion and processing. It contains almost all the symbols of the world, and each symbol is unique. In its coded world, each number represents a symbol, each symbol represents a number, and there is no ambiguity.

Unicode encoding, also known as the Unified Code, the universal code, a single code, it is the industry standard, is to solve the limitations of the traditional character encoding scheme, it has a unified and unique binary encoding for each character in each language, in order to meet the requirements of cross-language, cross-platform text conversion, processing. At the same time Unicode is a character set, it has many implementations such as: UTF-8, UTF-16.

UTF-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. repeat: UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.
The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More