About encoding formats

Source: Internet
Author: User
Tags character set
UTF is the Unicode translation format, which means to turn Unicode into a format. Characters defined in the Unicode basic multilingual plane (whether Latin alphabet, kanji, or other text or symbols) are stored in 2 bytes. The characters defined in the auxiliary plane are stored in the form of a surrogate pair (surrogate pair) with two 2-byte values.

        Unicode is a way of encoding, and ASCII is the same concept, and UTF is a form of storage (format).

        inside the JVM, when the virtual machine manages the data (in memory), or when the object is serialized, the characters (strings) are encoded in Unicode format.
        However, in the JVM, characters (strings) are stored as char (stored), a char occupies 2 bytes (for example, you can define char c= ' word '), that is, "word" and "Z" are also 2 bytes, and after the object is serialized, the object is UTF-8 stored, A Chinese account of 2 bytes, and English, numerals and so on only one byte, you can see the link below.

        Therefore, the serialized object only occupies about half of the usual space (when all the Chinese occupy the same space; Unicode occupies twice times as much space as UTF-8 in English).

        the advantage of UTF-16 compared to UTF-8 is that most characters are stored in fixed-length bytes (2 bytes), but UTF-16 is not compatible with ASCII encoding.

UTF-8: Unicode transformationformat-8bit, which allows BOM, but usually does not include BOM. is a multi-byte encoding used to solve the international character, which uses 8 bits (or one byte) in English, and Chinese uses 24 (three bytes) to encode. UTF-8 contains the characters that all countries in the world need to use, and is an international code with strong versatility. UTF-8 encoded text can be displayed on browsers that support the UTF8 character set in each country. For example, if it is UTF8 code, it can display Chinese in the foreigner's English ie, they do not need to download IE's Chinese language support package.

GBK is the standard of GB2312 compatible GB2312 on the basis of national standard. The text encoding of the GBK is expressed in double-byte notation, that is, both Chinese and English characters are represented by double-byte, in order to distinguish the language, the highest bit is set to 1. GBK contains all Chinese characters, is the country code, the generality is worse than the UTF8, but UTF8 occupies the database bigger than GBD.

GBK, GB2312, and UTF8 must be Unicode encoded to convert from one to the other: GBK, Gb2312--unicode--utf8 UTF8--UNICODE--GBK, GB2312

DIVCSS5 simple from the function says:

1,GBK usually refers to GB2312 encoding only support Simplified Chinese text

2,UTF usually refers to UTF-8, support Simplified Chinese characters, Traditional Chinese characters, English, Japanese, Korean and other languages (support text wider)

3, usually domestic use utf-8 and gb2312, to see their own needs to choose

Specific details are as follows:

For a website, forum, if the English characters are more, it is recommended to use UTF-8 to save space. But now many forums plug-ins generally only support GBK. A detailed explanation of the differences in coding is simply that UNICODE,GBK and the large five yards are encoded values, and utf-8,uft-16 is the expression of this value. And the preceding three kinds of coding is a compatible, the same Chinese character, that three code value is completely different. such as "Han" Uncode value and GBK is not the same, assuming that Uncode is A040,GBK for b030, and Uft-8 code, that is, the value of the form of expression. Utf-8 code completely only for Uncode to organize, if GBK to turn UTF-8 must first turn Uncode code, then turn Utf-8 on OK.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.