Four new knowledge points for UCS-2 and UTF8

Source: Internet
Author: User

the initial Unicode encoding is a fixed-length , 16-bit, or 22-byte representation of a character, which can represent a total of 65,536 characters. Obviously, it is not enough to represent all the characters in a variety of languages. The Unicode4.0 specification takes this into account, defines a set of additional character encodings, which are represented by 2 16-bit characters, so that up to 1,048,576 additional characters can be defined , and currently unicode4.0 only defines 45,960 additional characters.

Unicode is just an encoding specification, and there are currently three types of Unicode encodings actually implemented: Utf-8,ucs-2 and UTF-16, and three Unicode character sets can be converted according to specifications.

UTF-8

UTF-8 is a 8-bit Unicode character set, the encoding length is variable, and is a strict superset of the ASCII character set, meaning that the encoding of each character in ASCII is exactly the same in UTF-8. In the UTF-8 character set, a character may be 1 bytes, 2 bytes, 3 bytes, or 4 bytes long. In general, the alphabetic characters in Europe are 1 to 2 bytes long, while most of the characters in Asia are 3 bytes, and the additional characters are 4 bytes in length.

The UTF-8 character set is universally supported in UNIX platforms, HTML and most browsers also support UTF-8, while window and Java support UCS-2.

Key Benefits of UTF-8:

    • Less storage space is required for European alphabetic characters.
    • Easy migration from ASCII character set to UTF-8.

UCS-2

The UCS-2 is a fixed-length 16-bit Unicode character set. Each character is 2 bytes, and UCS-2 supports only unicode3.0, so additional characters are not supported.

Advantages of UCS-2:

    • The storage space requirement for Asian characters is less than UTF-8, because each character is 2 bytes.
    • Characters are processed faster than UTF-8 because they are fixed-length encoded.
    • Support for Windows and Java is better.

UTF-16

The UTF-16 is also a 16-bit coded character set. In fact,UTF-16 is the support of UCS-2 plus additional characters, which is the UCS-2 that conforms to the unicode4.0 specification. So UTF-16 is a strict superset of UCS-2.

the characters in the UTF-16 are either 2 bytes or 4 bytes. UTF-16 is mainly used in versions above Windows2000.

The advantages of UTF-16 relative UTF-8 are consistent with UCS-2.

Oracle provides support for Unicode from 7.0 onwards. The Oracle version of the Unicode character set branch is mainly:

Al32utf8

A UTF-8 coded character set that supports the latest unicode4.0 standards. The character length is 3 bytes, and the additional character is 4 bytes long.

UTF8

Support for unicode3.0 UTF-8 encoding method. because additional characters are presented in unicode3.1, the UTF8 does not support additional characters. However, unicode3.0 has reserved the encoding space for additional characters, so it is possible to insert additional characters into the UTF8 database, except that the database separates the characters into two parts, which takes up to 6 characters in length. Therefore, if you need to support additional characters, it is recommended that you switch the character set of the database to the new Al32utf8.

The UTF8 can be used in the database character set and also in the national character set.

Utfe

UTFE is a Unicode character set based on the EBCDIC platform, just like UTF8 on an ASCII platform. The difference is that, in Utfe, each character may account for 4 or three bytes, while additional characters require 2 4 bytes, or 8 bytes.

Al16utf16

AL16UTF16 is a UTF-16 encoded Unicode character set that is used in Oracle for the national character set.

Al24utffss

This character set only supports the unicode1.1 specification, which is used in the oracle7.2~8i version and is now obsolete.

Reference: http://www.ningoo.net/html/2007/unicode_encode_in_oracle.html

---------------------------------------------------------------------------

Summarize:

1. Windows uses UCS-2, but unexpectedly UCS-2 only supports unicode3.0, what does Windows do with additional characters? (Question: What are additional characters?) )
2. I did not expect the UTF8 to deal with additional characters, but there is a limit. Does UTF-8 also evolve in order to directly represent additional characters?
3. I don't understand why UTF-16 could be a 4 byte representation? Is it not a false name? Don't you have a special UTF-32?
4. Unicode 4.0 uses 4 bytes to represent (in fact, 2 16 bits), so that you can define 1,048,576 additional characters. That is, Unicode is not limited by 65535, but only 45,960 additional characters are currently defined (except for additional characters?). How much is that in total? )。 Why use a 4-byte representation, since only so little space is used?

Four new knowledge points for UCS-2 and UTF8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.