Unicode (UTF-8 UTF-16 gb18030, etc.)

Source: Internet
Author: User


Unicode is commonly known as unified code, universal code, single code, standard universal code.

Unicode development is under the responsibility of the non-profit organization unified code Alliance, which is committed to replacing the existing character encoding scheme with the Unicode scheme. Because some solutions often have only limited space and are not applicable to multilingual environments.

Unicode is recognized and widely used in the internationalization and localization of computer software.There are many new technologies, such as extensible slogans, Java programming languages, and modern operating systems, all adopt unicode encoding.

Unicode encoding and implementation

Generally speaking, Unicode encoding systems can be divided into two levels: encoding mode and implementation mode.

[Edit] encoding method

Unified code encoding method and ISO
10646 corresponds to the general character set (UCS) Concept
.At present, the actual application of the unified code version corresponds to the UCS-2, using a 16-bit encoding space. That is, each character occupies 2 bytes.In theory, a total of 216 (or 65536) characters can be entered. It basically meets the needs of various languages. In fact, the current version of the unified Code does not fully use the 16-bit encoding, but retains a lot of space for special use or future expansion.

The above 16-bit unified code character constitutes a basic multi-text plane (Basic
Multilingual plane (BMP ). The latest (but not widely used) unified code version defines 16 secondary planes, which together occupy at least 21 encoding spaces, slightly less than 3 bytes. But in fact, the secondary flat character still occupies 4 bytes of encoding space, consistent with the UCS-4. Future versions will be extended to ISO
10646-1 implementation level 3, covering all characters of the UCS-4. The UCS-4 is a larger 31-bit character set that has not yet been fully filled, plus a constant of 0 in the first place, a total of 32-bit, that is, 4 bytes. Theoretically, it can contain up to 231 characters, which can cover all the symbols used by languages.

The characters in the basic multi-text plane are encodedU + hhhh, Where eachHRepresents a hexadecimal number, which is exactly the same as the UCS-2 code. The corresponding 4-byte UCS-4 encoding after the two bytes are consistent, the first two bytes are all 0.

SeeGeneral Character Set.

[Edit] Implementation Method

Unicode is implemented in a different way than encoding. The Unicode encoding of a character is definite. However, in the actual transmission process, the design of different system platforms is not necessarily consistent, and for the purpose of saving space, the implementation of Unicode encoding is different. The Unicode implementation method is called Unicode conversion format (UNICODE
Transformation Format (UTF)

For example, if a Unicode file contains only seven ASCII characters, if each character is transmitted using a 2-byte original unicode encoding, the first byte's 8-bit is always 0. This results in a great waste. In this case, you can useUTF-8 encoding, which is a variable-length encoding that represents the basic 7-bit ASCII characters in 7-bit encoding and occupies one byte (first 0 ). When it is mixed with other Unicode characters, it will be converted according to a certain algorithm. Each character is encoded in 1-3 bytes and identified using the first 0 or 1. In this way, the length of the 7-bit ASCII document is greatly reduced (For details, referUTF-8). Similarly, for the future will appear 4 bytes of secondary Flat Characters and other UCS-4 extended characters, 2 byte encoding UTF-16 also needs to be converted through a certain algorithm.

For another example, If you directly use UTF-16 encoding that is consistent with Unicode encoding (only for BMP characters), since each character occupies two bytes, on the Macintosh (MAC) machine and PC, the understanding of byte order is inconsistent. At this time, the same byte stream may be interpreted as different content. For example, if a character is in hexadecimal format 4e59, it is split into 4E and 59 in two bytes, when reading data on Mac starts from low bytes
The OS will regard the 4e59 encoded as 594e and find the character "queue". In Windows, the character encoded as U + 4e59 is "B" starting from the high byte ". That is to say, in windows, the UTF-16 encoding to save a character "B", opened in Mac OS environment will be displayed as "Kui ". This case indicates that the encoding order of the UTF-16 may be obfuscated if it is not manually defined.Big-Endian,
Concepts of UTF-16 be, little-Endian
And the appendable byte sequence mark solution,Windows and Linux systems on PCs currently use UTF-16 by default for UTF-16 Encoding
Le
. (For specific solutions, seeUTF-16)

In addition, Unicode also includes UTF-7, punycode, CESU-8, scsu, UTF-32, gb18030, etc.Some of these implementation methods are only used in certain countries and regions, and some are future planning methods.At present, the general implementation method is UTF-16 small-end Order (LE), UTF-16 large-end Order (be) and UTF-8. In Microsoft Windows
In notepad (Notepad) attached to XP, the "Save as" dialog box can select four encoding methods to remove non-Unicode-encoded ANSI (for an English system, that is, ASCII encoding, the Chinese system is gb2312 or big5 encoding)
The other three types are Unicode (corresponding to the UTF-16 le), Unicode big endian (corresponding to the UTF-16 be), and UTF-8 ".

At present, the work of the secondary plane is mainly concentrated in the unified ideographic texts of China, Japan and Korea on the second and third planes, therefore, the coordination of various encodings and Unicode including GBK, gb18030, and big5 in simplified Chinese, traditional Chinese, Japanese, Korean, and Vietnamese characters has been highlighted. Considering that Unicode will eventually cover all characters. In a sense, these encoding methods can also be viewed as Unicode appearing beforeExisting factsIn the same way as ASCII and its extension Latin-1, the first byte of the first character in the 16-bit Unicode encoding space is all 0, and the second byte encoding is exactly the same as the original encoding. However, the correspondence between the above-mentioned East Asian language encoding and Unicode encoding is much more complex.

In common usage, the Java programming languageInputStreamReaderAndOutputStreamWriterSupports standard UTF-8 for reading and writing strings


Source: Wikipedia Unicode, link: http://zh.wikipedia.org/wiki/Unicode


Note: iso-8859-1 is the standard for Java network transmission, which is why Java Web programming, encounter garbled problem, sometimes use

String STR = new string (filename. getbytes (), "ISO8859-1"); transcoding reason.


Other articles that help deepen understanding:

1. Wikipedia UTF-8http: // zh.wikipedia.org/wiki/utf-8

2. Wikipedia Unicode Character flat ing http://zh.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A8% AE %E5%B9%B3%E9%9D%A2#.E5.9F.BA.E6.9C.AC.E5.A4.9A.E6.96.87.E7.A7.8D.E5.B9.B3.E9.9D.A2

3. Unicode character encoding table | Chinese unicode encoding range: 0x4e00 → 0x9fa5 http://467411.blog.163.com/blog/static/3353960920104122221346/

4. Blog Java coding analysis http://www.iteye.com/topic/311583

The following is non-Unicode

5. Wikipedia non-international code ISO8859-1 (ISO/IEC 8859-1: 1998, also known as Latin-1 or "Western European language"): http://zh.wikipedia.org/zh/ISO/IEC_8859-1

6. ANSI Encoding

To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range
1 character. For example, in the Chinese operating system, the two bytes [0xd6, 0xd0] are stored.

Different countries and regions have developed different standards, resulting in
Gb2312, big5, JIS and other respective encoding standards. These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.

Different ANSI encodings are incompatible. When information is exchanged internationally, texts in two languages cannot be stored in the same ANSI encoded text.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.