Four new knowledge points for UCS-2 and UTF8

Last Update:2015-06-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the initial Unicode encoding is a fixed-length , 16-bit, or 22-byte representation of a character, which can represent a total of 65,536 characters. Obviously, it is not enough to represent all the characters in a variety of languages. The Unicode4.0 specification takes this into account, defines a set of additional character encodings, which are represented by 2 16-bit characters, so that up to 1,048,576 additional characters can be defined , and currently unicode4.0 only defines 45,960 additional characters.

Unicode is just an encoding specification, and there are currently three types of Unicode encodings actually implemented: Utf-8,ucs-2 and UTF-16, and three Unicode character sets can be converted according to specifications.

UTF-8

UTF-8 is a 8-bit Unicode character set, the encoding length is variable, and is a strict superset of the ASCII character set, meaning that the encoding of each character in ASCII is exactly the same in UTF-8. In the UTF-8 character set, a character may be 1 bytes, 2 bytes, 3 bytes, or 4 bytes long. In general, the alphabetic characters in Europe are 1 to 2 bytes long, while most of the characters in Asia are 3 bytes, and the additional characters are 4 bytes in length.

The UTF-8 character set is universally supported in UNIX platforms, HTML and most browsers also support UTF-8, while window and Java support UCS-2.

Key Benefits of UTF-8:

Less storage space is required for European alphabetic characters.
Easy migration from ASCII character set to UTF-8.

UCS-2

The UCS-2 is a fixed-length 16-bit Unicode character set. Each character is 2 bytes, and UCS-2 supports only unicode3.0, so additional characters are not supported.

Advantages of UCS-2:

The storage space requirement for Asian characters is less than UTF-8, because each character is 2 bytes.
Characters are processed faster than UTF-8 because they are fixed-length encoded.
Support for Windows and Java is better.

UTF-16

The UTF-16 is also a 16-bit coded character set. In fact,UTF-16 is the support of UCS-2 plus additional characters, which is the UCS-2 that conforms to the unicode4.0 specification. So UTF-16 is a strict superset of UCS-2.

the characters in the UTF-16 are either 2 bytes or 4 bytes. UTF-16 is mainly used in versions above Windows2000.

The advantages of UTF-16 relative UTF-8 are consistent with UCS-2.

Oracle provides support for Unicode from 7.0 onwards. The Oracle version of the Unicode character set branch is mainly:

Al32utf8

A UTF-8 coded character set that supports the latest unicode4.0 standards. The character length is 3 bytes, and the additional character is 4 bytes long.

UTF8

Support for unicode3.0 UTF-8 encoding method. because additional characters are presented in unicode3.1, the UTF8 does not support additional characters. However, unicode3.0 has reserved the encoding space for additional characters, so it is possible to insert additional characters into the UTF8 database, except that the database separates the characters into two parts, which takes up to 6 characters in length. Therefore, if you need to support additional characters, it is recommended that you switch the character set of the database to the new Al32utf8.

The UTF8 can be used in the database character set and also in the national character set.

Utfe

UTFE is a Unicode character set based on the EBCDIC platform, just like UTF8 on an ASCII platform. The difference is that, in Utfe, each character may account for 4 or three bytes, while additional characters require 2 4 bytes, or 8 bytes.

Al16utf16

AL16UTF16 is a UTF-16 encoded Unicode character set that is used in Oracle for the national character set.

Al24utffss

This character set only supports the unicode1.1 specification, which is used in the oracle7.2~8i version and is now obsolete.

Reference: http://www.ningoo.net/html/2007/unicode_encode_in_oracle.html

---------------------------------------------------------------------------

Summarize:

1. Windows uses UCS-2, but unexpectedly UCS-2 only supports unicode3.0, what does Windows do with additional characters? (Question: What are additional characters?) ）
2. I did not expect the UTF8 to deal with additional characters, but there is a limit. Does UTF-8 also evolve in order to directly represent additional characters?
3. I don't understand why UTF-16 could be a 4 byte representation? Is it not a false name? Don't you have a special UTF-32?
4. Unicode 4.0 uses 4 bytes to represent (in fact, 2 16 bits), so that you can define 1,048,576 additional characters. That is, Unicode is not limited by 65535, but only 45,960 additional characters are currently defined (except for additional characters?). How much is that in total? ）。 Why use a 4-byte representation, since only so little space is used?

Four new knowledge points for UCS-2 and UTF8

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Four new knowledge points for UCS-2 and UTF8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Four new knowledge points for UCS-2 and UTF8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support