Character Set and character encoding (Charset & Encoding) _ Other synthesis

Source: Internet
Author: User
Tags control characters reserved

I believe you must have met, open a Web page, but show a heap of like garbled, such as "бїяазъся", "????????"? Remember the message header fields Accept-charset, accept-encoding, Accept-language, content-encoding, content-language in HTTP? And that's what we're going to discuss next.


    1. Basic knowledge
    2. Common character set and character encoding
    2.1. ASCII character set & encoding 2.2. Gbxxxx Character Set & encoding 2.3. BIG5 Character Set & encoding
    3. The great creation of Unicode
    3.1.UCS & unicode3.2.utf-323.3.utf-163.4.utf-84.accept-charset/accept-encoding/accept-language/content-type/ Content-encoding/content-language Reference & Further Reading

    1. Basic knowledge

    The information stored in the computer is represented by binary numbers, and the characters we see on the screen are the result of binary conversion. In layman's terms, the rules used to store characters in a computer, such as ' a ' with what is called "encoding", conversely, the binary numbers stored in the computer are displayed, called "Decoding", like cryptography and decryption. In the decoding process, if the wrong decoding rule is used, it causes ' a ' to parse into ' B ' or garbled.

    Character Set (Charset): is a collection of all abstract characters supported by a system. Characters are the general name of various words and symbols, including the national characters, punctuation marks, graphic symbols, numbers and so on.

    character encoding (Character Encoding): A set of rules that can be used to pair a set of characters of a natural language (such as an alphabet or a syllable table) with a set of other things, such as numbers or electrical pulses. It is a basic technology of information processing that establishes the correspondence between the symbolic set and the digital system. Usually people use symbolic collections (usually text) to express information. and computer-based Information Processing system is the use of components (hardware) The combination of different States to store and process the message. The combination of different states of a component can represent numbers in a digital system, so character encoding is the number of converts to a digital system acceptable to a computer, called a digital code.

    2. Common character set and character encoding

    Common character Set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, and so on. To accurately handle various character set characters, the computer needs to be coded so that the computer can recognize and store all kinds of text.

    2.1. ASCII Character Set & encoding

    ASCII (American Standard Code for information interchange, The American Standard Code for information interchange is a set of computer coding systems based on the Latin alphabet . It is mainly used to display Modern English, and its extended version Eascii can barely show other western European languages. It is today's most common single-byte coding system (but there are signs of being overtaken by Unicode) and is equivalent to the International standard ISO/IEC 646.

    ASCII Character Set : mainly includes control characters (enter key, backspace, newline keys, etc.), and can display characters (English uppercase and lowercase characters, Arabic numerals, and Latin symbols).

    ASCII encoding : A rule that converts an ASCII character set to the number of digital systems that the computer can accept. Uses 7 bits (BITS) to represent a character, a total of 128 characters, but a 7-bit coded character set can only support 128 characters, in order to represent more European common characters to extend the ASCII, the ASCII extended character set uses 8 bits (BITS) to represent one character and a total of 256 characters. The ASCII character set maps to the digital encoding rules as shown in the following illustration:

    Figure 1 ASCII encoding table

    Figure 2 Extended ASCII encoding table

    The biggest drawback of ASCII is that it can only display 26 basic Latin, Arabic and English punctuation, so it can only be used to display modern American English (and all accented symbols have to be removed when dealing with foreign words such as naïve, café, élite, etc.) in English. Even if you do this will violate the spelling rules. While Eascii solves some of the problems of Western European language display, it is still powerless in many other languages. So now Apple computers have abandoned ASCII and switched to Unicode.

    2.2. Gbxxxx Character Set & encoding

    The invention of the computer and the back for a long time, only used in the United States and some developed countries in the West, ASCII can well meet the needs of users. But in order to display Chinese, a set of encoding rules must be designed to convert Chinese characters into digital systems acceptable to the computer.

    The celestial experts canceled the bizarre symbols after number 127th (i.e. eascii). Rule: A character less than 127 has the same meaning as the original, but two words greater than 127 fonts together, representing a Chinese character, the preceding byte (which he calls the high byte) is used from 0xa1 to 0xf7, The next byte (Low byte) from 0xa1 to 0xFE, so that we can combine about 7,000 more simplified Chinese characters. In these codes, the mathematical symbols, the Roman Greek alphabet, and the Japanese kana are all codified, even in ASCII, the original number, punctuation, letters have all been a two byte long coding, which is often said "full-width" characters, and originally in the number below 127th is called "half-width" character.

    The above coding rule is GB2312. GB2312 or gb2312-80 is the Chinese national standard Simplified Chinese character set, the full name " information exchange with Chinese character encoding character set, basic Set ", also known as GB0, issued by the China National Standards Administration, Implemented May 1, 1981. GB2312 codes are used in mainland China, and in Singapore and other fields. Almost all Chinese systems and international software in mainland China support GB2312. The appearance of GB2312 basically satisfies the need of the computer processing of Chinese characters, and its Chinese characters have already covered 99.75% of the use frequency of Chinese mainland. For the names of people, ancient Chinese and other aspects of the general antiseptic word, GB2312 can not be processed, which led to the GBK and the GB 18030 character set appeared later. The following figure is the beginning of the GB2312 encoding (because it is very large, only the beginning of the list, you can view the GB2312 Simplified Chinese Code table):

    Figure 3 GB2312 the beginning of the encoded table

    Since GB 2312-80 contains only 6,763 Chinese characters, there are a lot of Chinese characters, such as the introduction of GB 2312-80 after the Simplified Chinese characters (such as "Hello"), part of the name of the word (such as the "Rong" of former Chinese premier Zhu Rongji), Taiwan and Hong Kong use of traditional characters, Japanese and Korean characters, not included. So manufacturer Microsoft utilizes GB 2312-80 unused coding space, contains GB 13000.1-93 all characters to develop the GBK code. According to Microsoft data, GBK is an extension of gb2312-80, the CP936 codewords table (Code Page 936) extension (before CP936 and GB 2312-80), the earliest implementation of Windows 95 Simplified Chinese version. Although GBK contains all the characters for the GB 13000.1-93, it is not encoded the same way. GBK itself is not a national standard, but has been the state Bureau of Technical Supervision of standardization, the Ministry of Electronics Industry and technology and quality Supervision of the Department of Public Disclosure as "technical guidance document." The original GB13000 has not been adopted by the industry, followed by national standards GB18030 technically compatible GBK rather than GB13000.

    GB 18030, full name: GB 18030-2005 "Information technology Chinese code character set", is the latest in the People's Republic of China code word set, is the GB 18030-2000 "Information technology information exchange with Chinese character coded character set of the expansion of the basic set" revision. Fully compatible with GB 2312-1980, and GBK basic compatibility, support GB 13000 and Unicode all unified Chinese characters, a total of 70,244 Chinese characters. GB 18030 mainly has the following characteristics:

    As with UTF-8, multibyte encodings allow each word to consist of one, 2, or 4 bytes. The coding space is large and can be defined up to 1.61 million characters. Support the Chinese minority's writing, does not need to use the word-formation area.
    Chinese characters included traditional Chinese characters and Japanese and Korean characters

    Figure 4 GB18030 Encoding overall structure

    The first edition of this specification to the People's Republic of China Ministry of Information Industry Standardization Institute of Electronics drafted by the State Bureau of Quality and Technical supervision issued on March 17, 2000. The current version of the State administration of quality Supervision and inspection and China National Standardization Management Committee issued on November 8, 2005, May 1, 2006 implementation. This specification is mandatory for all software products supported in China.

    2.3. BIG5 Character Set & encoding

    Big5, also known as large five yards or five large yards , is the most commonly used computer Chinese character set standard in traditional Chinese (Roman Chinese) community, with a total of 13,060 Chinese characters. Chinese code is divided into two types of code and Exchange Code, BIG5 is the Chinese code, well-known Chinese exchange code has CCCII, CNS11643. Although Big5 is popular in Taiwan, Hong Kong and Macao and other traditional Chinese transit areas, but for a long time is not the local national standards, but only industry standards . Depending on the sky Chinese system, windows and other major systems of the character sets are based on BIG5, but the manufacturers and their respective increase in different word-formation and word-forming area, derived from a variety of different versions. In 2003, Big5 was included in the appendix of the CNS11643 Chinese standard Interchange code, and achieved a more formal status. This latest version is called big5-2003.

    Big5 code is a set of Double-byte character sets, using a dual eight-yard storage method, with two bytes to place a word. The first byte is called "High byte", and the second byte is called "Low byte." "High byte" uses 0x81-0xfe, "low byte" uses 0x40-0x7e, and 0xa1-0xfe. in the Big5 partition:


    - reserved for user-defined characters ( word-word area) /span>


    punctuation , Greek alphabet and special symbols, including nine 0xa259-0xa261 in 兙 兛 兞 兝 兡 兣 嗧. kw.


    Keep. This area is not open for use as a character-making area.


    Common characters , first by stroke, and then by radical .


    Reserved for user-defined characters (word-word region)


    times common Kanji is also sorted by stroke and by radical.


    reserved for user-defined characters (word-word area) /strong>

    Unicode character Set &UTF encoding

    3. The great creation of Unicode

    --have to say Unicode alone

    Like the celestial World, when a computer is transmitted to various countries, a similar GB232/GBK/GB18030/BIG5 coding scheme is designed and implemented to suit local languages and characters. Such a set, in the local use no problem, once appeared in the network, because of incompatible, mutual access to the phenomenon of garbled.

    In order to solve this problem, a great creation created a--unicode. Unicode encoding systems are designed to express any character in any language. It uses 4-byte numbers to express each letter, symbol, or Ideograph (Ideograph). Each number represents the only symbol that is used in at least one language. (Not all numbers are used, but the total is over 65535, so a 2-byte number is not enough.) Characters that are common in several languages are usually encoded with the same number, unless there is a reasoned etymology (etymological) reason not to do so. Regardless of this situation, each character corresponds to a number, and each digit corresponds to one character. That is, there is no ambiguity. The record "mode" is no longer required. u+0041 always stands for ' a ', even if the language does not have a ' a ' character.

    In the field of computer science,Unicode( Unified Code , Universal Code, single code , standard Universal Code ) is a standard in the industry, It enables the computer to embody dozens of of the world's language systems. Unicode is developed based on the standard of the universal Character set (Universal Character set) and is also published in the form of books [1]. Unicode is also constantly being amplified, with each new version inserting more new characters. The sixth edition so far, Unicode already contains more than 100,000 characters (in 2005, the 100,000th character of Unicode was adopted and recognized as one of the standards), a set of code diagrams that can be used as visual references, a set of coding methods and a set of standard character encodings, a set of included superscript words, The enumeration of character attributes such as subscript characters. The Unicode organization (the Unicode Consortium) is operated by a non-profit organization and leads the subsequent development of Unicode, with the goal of replacing the existing character encoding scheme with a Unicode encoding scheme, especially if the existing scheme is in a multilingual environment , there are only limited space and incompatible problems.

    ( This can be understood: Unicode is a character set, and Utf-32/utf-16/utf-8 is a three-character encoding scheme.) )

    3.1.UCS & UNICODE

    The Universal Character Set (Universal Character Set,UCS) is an ISO 10646(known as ISO/ IEC 10646 Standard character set defined by the standard. There are two independent organizations that attempt to create a single character set, the Unified Code Alliance of the International Organization for Standardization (ISO) and the multi-language software manufacturer. The former developed the ISO/IEC 10646 project, the latter developed by the unified Code project. Therefore, different standards were originally developed.

    Around 1991, participants in two projects realized that the world did not need two incompatible character sets. As a result, they began merging the work of both sides and working together to create a single coding table. Starting with Unicode 2.0, Unicode uses the same font and codewords as ISO 10646-1, and ISO 10646 will not assign values for UCS-4 encodings beyond U+10FFFF to keep them consistent. All two projects are still in existence and their respective standards are published independently. However, both the Unified Code Alliance and the ISO/IEC JTC1/SC2 agreed to maintain the standard Code table compatibility and to work closely together to adjust any future extensions. Unicode generally uses the most common font for codewords at the time of publication, but ISO 106,461 uses the century font as much as possible.


    The above 4-byte number is used to express each letter, symbol, or Ideograph (Ideograph), and each number represents the only encoding scheme for symbols used in at least one language, called UTF-32. UTF-32 also called UCS-4 A protocol that encodes Unicode characters, using 4 bytes for each character. In terms of space, it is very inefficient.

    This method has its advantages, the most important thing is that you can locate the nth character in the string within a constant time, because the nth character starts with the first byte 4xNth. While the use of fixed-length bytes for each code bit seems convenient, it is not widely used as other Unicode encodings.


    Although there are many Unicode characters, in fact most people do not use more than the first 65,535 characters. As a result, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes 0–65535-wide characters into 2 bytes, and if you really need to express those Unicode characters that are rarely used in the "Star Mount Layer (astral plane)" that exceed these 65535 ranges, you need to use some tricky tricks to implement them. The most obvious advantage of UTF-16 encoding is that it is twice times more space efficient than UTF-32 because each character requires only 2 bytes to store (except for the 65535 range), not 4 bytes in the UTF-32. And, if we assume that a string does not contain characters in any of the astral layers, then we can still find the nth character within the constant time, until it is not tenable, which is always a good inference. The encoding method is:

    If the character encoding U is less than 0x10000, that is, the decimal 0 to 65535, the direct use of two byte representation;
    If the character encoding U is greater than 0x10000, because the Unicode encoding range is the largest 0x10ffff, there are 0xFFFFF encodings between 0x10000 and 0x10ffff, which means that 20 bits are required to mark these encodings. Using U ' to represent the value between 0-0XFFFFF, its first bit as a high and bit numeric 0xd800 for logical or operations, the last bit as a low and 0xdc00 to do logical or operations, so that the composition of the 4 byte is composed of U's code.

    There are some other, less obvious drawbacks to the UTF-32 and UTF-16 coding methods. Different computer systems save bytes in a different order. This means that character u+4e2d may be saved as 4E 2D or 2D 4E under UTF-16 encoding, depending on whether the system uses a large end (Big-endian) or a small end (Little-endian). (for UTF-32 encoding, there are a number of possible byte permutations.) As long as the document does not leave your computer, it is still safe--different programs on the same computer use the same byte sequence (byte order). But when we need to transfer this document between systems, perhaps on the World Wide Web, we need a way to indicate how our bytes are stored at the moment. Otherwise, the computer receiving the document will not know whether the two bytes 4E 2D expresses u+4e2d or u+2d4e.

    To solve this problem, multibyte Unicode encodings define a "byte order Mark", which is a special nonprinting character that you can include at the beginning of the document to indicate the byte order you are using. For UTF-16, the byte order mark is U+feff. If you receive a UTF-16 encoded document that begins with the byte FF Fe, you can determine that its byte order is one-way (one way), and if it starts with FE FF, you can determine the byte order reverse.


    UTF-8(8-bit Unicode Transformation Format) is a variable-length character encoding (fixed length code) for Unicode and a prefix code. It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII, which allows the software that originally handles ASCII characters to be used without or requiring only a few modifications. As a result, it gradually becomes the preferred encoding for e-mail, Web pages and other applications that store or transmit text. The Internet Engineering Task Force (IETF) requires all Internet protocols to support UTF-8 coding.

    UTF-8 uses one to four bytes for each character encoding:

        128 US-ASCII characters are encoded in only one byte (Unicode range from u+0000 to u+007f). Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, and Syriac with additional symbols require two byte encodings (Unicode range from u+0080 to u+07ff). The characters in other basic multilingual planes (BMP), which contain most of the commonly used words, are encoded using three bytes.
        Other characters that rarely use Unicode auxiliary planes use a four-byte encoding.

        is very effective at dealing with ASCII characters that are often used. It is no worse than UTF-16 in processing extended Latin character sets. For Chinese characters, it's better than UTF-32. Also, (you have to believe me on this one, because I'm not going to show you how it works.) By the nature of bitwise manipulation, the problem of using UTF-8 no longer exists in byte order. A document encoded in Utf-8 is the same bit stream between different computers.

        Overall, it is impossible to determine the length required by the number of yards in a Unicode string, or where the cursor should be placed in the text buffer after the string is displayed, and the combination of characters, widening fonts, nonprinting characters, and right-to-left text are attributed. So although the number of characters in the UTF-8 string is more complicated than the number of code points in the UTF-32, there are very few situations in the real world that can be encountered differently.


    UTF-8 is a superset of ASCII. Because a pure ASCII string is also a valid UTF-8 string, the existing ASCII text does not need to be converted. Software designed for the traditional extended ASCII character set can often be used with UTF-8 without modification or modification. Sorting UTF-8 by using a standard byte-oriented sort routine produces the same results that are sorted based on Unicode code points. (although this is only limited usefulness, it is unlikely to have a still-acceptable text order in any particular language or culture.) UTF-8 and UTF-16 are standard encodings for Extensible Markup language documents. All other encodings must be specified by an explicit or textual declaration. Any byte-oriented string search algorithm can be used for UTF-8 data (as long as the input is composed only of the complete UTF-8 characters). However, you must be careful with regular expressions or other structures that contain character counts.
    The UTF-8 string can be reliably identified by a simple algorithm. Is that the likelihood that a string behaves as a legitimate UTF-8 in any other encoding is low and decreases as the string length grows. For example, character values c0,c1,f5 to FF never appear. For better reliability, you can use regular expressions to count illegal too long and alternate values (you can view regular expressions that validate UTF-8 strings on W3 faq:multilingual forms).


    Because each character uses a different number of byte encodings, finding the nth character in a string is an O (N) complex operation-that is, the longer the string, the more time it takes to locate a particular character. At the same time, a bit transformation is needed to encode the characters into bytes and decode the bytes into characters.


    In HTTP, the message headers associated with character sets and character encodings are accept-charset/content-type, and the main district is divided into accept-charset/accept-encoding/accept-language/ Content-type/content-encoding/content-language:

    Accept-charset: The browser declares itself to receive the character set, which is the various character sets and character encodings described earlier in this article, such as gb2312,utf-8 (usually we say Charset includes the corresponding character encoding scheme);

    Accept-encoding: The browser declares itself to receive the encoding method, usually specifies the compression method, whether to support compression, support what compression method (Gzip,deflate), (note: This is not a letter code);

    Accept-language: The browser declares the language it receives. The difference between language and character set: Chinese is a language, there are many characters in Chinese, such as BIG5,GB2312,GBK and so on;

    The Content-type:web server tells the browser the type and character set of the object it responds to. For example: content-type:text/html; charset= ' gb2312 '

    The Content-encoding:web server indicates what compression method (Gzip,deflate) It uses to compress the objects in the response. For example: Content-encoding:gzip

    Content-language:web the language of the object the server tells the browser to respond to itself.

    References & Further Reading
        Baidu Encyclopedia. Character Set., 2010-12-28 wikipedia. Character encoding. e7%ac%a6%e7%bc%96%e7%a0%81, 2011-1-5 wikipedia. Ascii. Http://, 2011-4-5 wikipedia. GB2312., 2011-3-17 wikipedia. GB18030., 2010-3-10 wikipedia. GBK. Http://, 2011-3-7 wikipedia. Unicode. Http://, 2011-4-30laruence. The character code is detailed (base)., 2009-8-22jan Hunt. Character Sets and Encoding for Web Designers-ucs/unicode.

    Author: Wu Qin
    This article is based on the signed 2.5 China mainland license Agreement, Welcome to reprint, deduction or for commercial purposes, but must retain this article's signature Wu Qin (including links).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.