Java Chinese garbled solution (1) ----- recognition of character sets, java Solution

Source: Internet
Author: User

Java Chinese garbled solution (1) ----- recognition of character sets, java Solution

After a long silence (about three more months), LZ began to write blogs as he couldn't help it!

The Chinese problem in java encoding is a common problem. Every time a Chinese garbled LZ is encountered, it is either modified based on previous experience or baidu.com to solve the problem. After reading many blog posts about solutions to Chinese garbled characters, I found that we did not have a clear understanding of the problem (including myself, therefore, LZ wants to thoroughly analyze and solve java Chinese Garbled text through this series of blog posts (only a few articles are estimated). If there are any errors, I hope you can point them out! Of course, this series of blog posts are not completely original in LZ, and are summarized based on their predecessors. If they are similar, they are purely for reference ......

Problem Origin

For a computer, it only recognizes two 0 and 1, whether in memory or on an external storage device, the text, images, videos, and other "data" that we see are already in binary format on computers. The rules for the numbers of binary numbers corresponding to different characters are character encoding. A character encoding set is called a character set.

In early computer systems, there were very few characters used. They only contained 26 English letters, numbers, and some common symbols that were encoded, it is enough to use one byte, but with the continuous development of computers, in order to adapt to the languages of other nations in the world, these little character encoding is certainly not enough. Therefore, UNICODE encoding is proposed, which adopts dual-byte encoding and is compatible with English characters and double-byte encoding of other countries.

Each country requires the character set encoding for computer information exchange in the country/region for the purpose of unified encoding. To solve the computer processing of local character information, various localized versions have emerged, and LANG has been introduced, codepage. Currently, most of the software's core Character Processing Systems with internationalization features are Unicode-based. During software running, the corresponding local character encoding settings are determined based on the Locale/Lang/Codepage settings at that time, and handle local characters accordingly. Unicode and local character sets must be converted during processing.

At the same time, java adopts Unicode encoding internally, therefore, in the process of running java, Unicode encoding and the corresponding computer operating system or the encoding formats supported by the browser must be converted to each other. This conversion process involves a series of steps, if an error occurs in a step, the output text will be garbled.

The problem with java garbled characters is that the JVM and the corresponding operating system/browser encountered an error in encoding format conversion.

In fact, the method to solve the java Garbled text problem is relatively simple, but the reason is to understand the principles behind it.

In fact, the method to solve the problem of Chinese character encoding in JAVA programs is often very simple, but to understand the reasons behind it, to locate the problem, you also need to understand the existing Chinese character encoding and encoding conversion.

Common character encoding

To accurately process characters in various character sets, a computer must encode the characters so that the computer can recognize and store various texts. Common character encodings include: ASCII, GB **, and Unicode. The following LZ is a brief introduction! (Why is it a brief introduction? When LZ looks for information on the Internet and wants to learn about character encoding, it is much more complicated than I think. Therefore, LZ needs to provide a detailed introduction, so let's take a look !!)

1. ASCII Encoding

ASCII, American Standard Code for Information Interchange, is a computer coding system based on Latin letters. It is mainly used to display modern English and other Western European languages. It is currently the most common single-byte encoding system.

The ASCII Code uses the specified combination of seven or eight binary numbers to indicate 128 or 256 possible characters. The standard ASCII Code uses 7 (2 ^ 7 = 128) bits to indicate that all uppercase/lowercase letters, numbers, and punctuation marks have some special control characters, the first one is set to 0. 0 ~ 31 and 127 (33 in total) are control characters or communication special characters, 32 ~ 126 (95 characters in total) is a character (32 is a space), of which 48 ~ 57 is 0 to 9 ten Arabic numerals, 65 ~ 90 is 26 uppercase English letters, 97 ~ There are 26 lower-case English letters, and the remaining are some punctuation marks and operator numbers.



2. GBK *** Encoding

The biggest disadvantage of ASCII is that the display character is limited. Although it solves the display problem of some Western European languages, it is really incompetent for other languages. With the development of computer technology, the application scope is getting wider and wider, and the ASCII defects become more and more obvious. To use computers in other countries and regions, you must design a set of coding rules that comply with your own country/region. For example, to display Chinese characters, we must design a set of encoding rules to convert Chinese characters into numbers that are acceptable to computers.

GB2312,It is used for information exchange between Chinese Character Processing and Chinese Character communication systems and is widely used in Chinese mainland. Its Encoding Rules are as follows: the meaning of a character smaller than 127 is the same as that of the original character, but when two characters larger than 127 are connected together, it indicates a Chinese character, the previous byte (also known as the high byte) uses 0xF7 from 0xA1, And the next byte (low byte) ranges from 0xA1 to 0xFE, in this way, we can combine over 7000 simplified Chinese characters. Although the GB2312 contains so many Chinese characters, the usage rate can reach 99%. However, for uncommon Chinese characters such as names, place names, and ancient Chinese characters, it cannot be processed, so the following GBK and GB 18030 appear. (Click the GB2312 Simplified Chinese encoding table ).

GB18030,Full name: National Standard GB 18030-2005 "Information Technology Chinese encoding Character Set", is China's computer system must follow one of the basic standards, GB18030 has two versions: GB18030-2000 and GB18030-2005. GB18030-2000 is A replacement version of GBK. Its main feature is to add the CJK Unified Chinese character to expand A Chinese character on the basis of GBK.

GB 18030 has the following features:

Same as the UTF-8, It is multi-byte encoded, and each word can consist of 1, 2, or 4 bytes.

The encoding space is huge. It can contain up to 1.61 million characters.

Supports Chinese characters of ethnic minorities and does not need to be written in the word area.

Chinese characters include traditional Chinese characters and Japanese and Korean characters

GBK,One of the Chinese character encoding standards, the full name of "Chinese character internal code extension specification", which is backward compatible with the GB 2312 encoding and supports the ISO 10646.1 international standard. It is an up-and-down standard in the transition process of the former. Its Encoding range is as follows:

Unicode encoding

As mentioned above, there are so many countries in the world that there are also a variety of encoding styles, such as Chinese Gbit/s, GBK, and GB18030, although there is no problem running locally, once it appears on the network, access will be garbled due to incompatibility. To solve this problem, the great Unicode encoding was born.

Unicode encoding enables computers to convert and process texts across languages. It contains almost all the symbols in the world, and each symbol is unique. In its coding world, each digit represents a symbol, and each symbol represents a number, without any ambiguity.

Unicode encoding, also known as unified code, universal code, and single code, is a standard in the industry and is produced to address the limitations of traditional character encoding solutions, it sets a unified and unique binary code for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. At the same time Unicode is a character set, it has many implementation methods such as: UTF-8, UTF-16.

UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet.Repeat: UTF-8 is one of the Unicode implementations.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
The encoding rules for UTF-8 are simple, with only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first n bits of the first byte are set to 1, and the n + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.

Recommended reading

This blog post is just an opening post. It is used to introduce character sets. It is simple and has no too many descriptions, because the LZ found that the character set was really too complex during the query of character set data, and LZ was a little unable to handle it. You need to study it carefully and then write a more detailed blog! Coming soon !!

References:

Character Set and character encoding: http://www.cnblogs.com/skynet/archive/2011/05/03/2035105.html

Baidu encyclopedia ASCII: http://baike.baidu.com/view/15482.htm

Baidu Encyclopedia: GB2312: http://baike.baidu.com/view/443268.htm? Fromtitle = GB2312 & fromid = 483170 & type = syn

Baidu Encyclopedia: GB18030: http://baike.baidu.com/view/889058.htm

Baidu Encyclopedia: GBK: http://baike.baidu.com/view/931619.htm? Fromtitle = GBK & fromid = 481954 & type = search

Baidu Encyclopedia: Unicode: http://baike.baidu.com/view/40801.htm

Baidu Encyclopedia: UTF-8: http://baike.baidu.com/view/25412.htm

If there is any error, forget to point it out !! Thank you !!!

----- Original from: http://cmsblogs.com /? P = 1395Please respect the author's hard work and repost the source.

----- Personal site:Http://cmsblogs.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.