Unicode and uft-8, ascall

Source: Internet
Author: User
I have never figured out what the Unicode Character Set and uft-8 encoding are like in The ascall character set. Article I understand a little bit.
Unicode
(Computer) The most common standard character set after ASCII. ASCII is still the foundation of computer operation. However, there are too few. It cannot keep pace with the development of computer applications. Unicode is more powerful. The first 255 Unicode characters can be mapped to the ASCII two table.
In April 1984, ISO/IEC JTC1/SC2/WG2 Working Group was established. Unified encoding of texts and symbols in different countries. In 1991, a U.S. multinational company set up Unicode Consortium. And reached an agreement with WG2 on October 1991. Use the same encoding word set. Currently, Unicode uses a 16-bit encoding system. Its character set is the same as BMP (Basic multilingual plane) of iso000046. Unicode passed DIS (draf international standard) in June 1992 ). The current version V2.0 was released in 1996. The content contains 6811 symbols. There are 20902 Chinese characters. 11172 in Korean and pinyin. There are 6400 word-building areas. Retain 20249. A total of 65534.
With the rapid development of the Internet. The demand for data exchange is growing. Different coding systems are increasingly becoming an obstacle to information exchange. In addition, documents that coexist in multiple languages are increasing. Independent Code Pages are hard to solve. Unicode came into being.
Unicode has a double meaning. Unicode is an international standard for ISO/iec000046 encoding. It is also called a big character set. It is an important international standard promulgated by ISO in 1993. Its purpose is to unify the coding of all types of languages around the world ). In addition, it is also the name of a consortium group consisting of large enterprises such as HP, Microsoft, IBM, and apple in the United States. The purpose of the Group is to promote the uniform multi-text encoding.
Unicode is the most significant difference from the popular code page: Unicode is the full encoding of two bytes. ASCII characters are also expressed in two bytes. The code page is determined by the value range of the high byte to be an ASCII character. Or the high byte of Chinese characters. If data corruption occurs. Some content is damaged. This will cause confusion of the subsequent Chinese characters. Unicode uses two bytes to represent one character. The most obvious advantage is that it simplifies the processing of Chinese characters.
Unicode uses a plane to describe the encoding space. Each plane is divided into 256 rows. 256 columns. It is two bytes higher than the two-byte encoding.
The first plane of Unicode. It is called Basic multilingual plane ). BMP for short. Because BMP is represented in only two bytes. So it is favored.
The initial objective of Unicode. It uses a 16-bit encoding to provide ing for over 65000 characters. But this is not enough. It cannot cover all historical texts. It cannot solve the transmission problem (implantation head-ache's ). Especially for network-based applications. Therefore. Unicode uses three encoding methods based on some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As shown in the name. In UTF-8. The characters are encoded in 8-bit sequence. Represents a character in one or several bytes. The biggest benefit of this method. Is that the UTF-8 retains the ASCII character encoding as part of it. For example. In UTF-8 and ASCII. "A" encoding is 0x41. The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Taking into account the initial purpose. Unicode generally refers to the UTF-16.
Over the years. Computers generally use American Standard Code for information interchange (ASCII Code) to represent characters. These characters can be letters. Number. Punctuation and control. It is not a problem to use this encoding to indicate English characters. However, it indicates other languages, such. Arabic. Chinese. Japanese. Wei Wen. Havin... Must be expanded. In May 1987. Joe Becker and Lee Collins at the Xerox Palo Alto Research Center. And Apple's Mark Davis tried to study a character encoding method suitable for multi-text processing. This encoding was quickly supported by many large companies. These companies all sent representatives to the Unicode Research Group. Unicode research has made rapid progress. The Unicode group is a member of the world's leading system and software manufacturer. So Unicode soon became the de facto industrial standard.
Unicode-based systems can use 65000 different characters. It is good enough to cover all the letters in all languages of the world. Add thousands of symbols.
. The general scripts area contains 19 languages. Including ASCII, Latin1, Greek, week, Armenian, hedrew, Arabic, Devanagari, Bengali, California, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, outside of Georgian and other languages. It also includes Chinese. A large number of characters in Japanese and Korean.
Unicode is a fixed-length 2B multi-text character set encoding. It tries to improve the existing standards of relevant countries and regions. Including gb2312, cns000043, JIS 0208, and KSC 5601. Unicode can represent mixed text. It can also ensure the previous ISO 10646.

Utf8 = Unicode Transformation format -- 8 bit
It is a unicode transfer format. Converts a Unicode file to a byte Transfer Stream.

utf8 Stream Conversion Program :
input: unsigned integer c-the Code Point of the character to be encoded (enter a unicode value)
Unicode is an encoding table, for example, specifying a code for a Chinese character. Similar to GB2312-1980, gb18030, etc., but the word set is different.
==================================< br> A Unicode code may be converted to a byte or two bytes in length, the utf8 code of three or four bytes depends on the Unicode code value. Because the English Unicode code is less than 0x80, it is faster to use utf8 of a byte than to send Unicode two bytes.
utf8 is the "re-encoding" Method for Unicode transfer.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.