Knowledge that every programmer must possess, Character Set and character encoding.

Source: Internet
Author: User
Tags printable characters
1. Basic Knowledge

The information stored in the computer is represented by the binary number, and the English letters, Chinese characters, and other characters we see on the screen are the result after the binary number conversion. In general, the rules are used to store characters in a computer. For example, what 'A' represents is called "encoding". Otherwise, the binary numbers stored in the computer are parsed and displayed, it is called "decoding", just like encryption and decryption in cryptography. If an incorrect decoding rule is used during decoding, 'A' is resolved to 'B' or garbled.

Character Set): A collection of all abstract characters supported by the system. A character is a general term for all types of texts and symbols, including Chinese characters, punctuation marks, graphical symbols, and numbers.

Character encoding): Is a set of rules that can be used to pair a set of natural language characters (such as an alphabet or syllable table) with a set of other things (such as numbers or electric pulses. That is, establishing a correspondence relationship between a symbolic set and a digital system is a basic technology of information processing. People usually use a collection of symbols (usually text) to express information. Computer-based information processing systems store and process information by combining different states of components (hardware. The combination of different States of a component can represent numbers in the digital system. Therefore, character encoding refers to converting a symbol into a number that can be accepted by a computer. It is called a digital code.

2. Common Character sets and character encoding

Common Character Set names: ASCII character set, gb2312 Character Set, big5 Character Set, gb18030 character set, and Unicode Character Set. To accurately process characters in various character sets, a computer must encode the characters so that the computer can recognize and store various texts.

3. Unicode

  • -- Unicode must be mentioned separately.

    Like tianchao, when computers are transferred to various countries in the world, a code scheme similar to gbw./ GBK/gb18030/big5 is designed and implemented to suit local languages and characters. In this way, there is no problem in local use. Once it appears in the network, due to incompatibility, garbled code occurs during mutual access.

    In order to solve this problem, a great idea has produced Unicode. The Unicode encoding system is designed to express any characters in any language. It uses 4-byte numbers to express each letter, symbol, or ideograph ). Each number represents a unique symbol used at least in a language. (Not all digits are used, but the total number has exceeded 65535. Therefore, two bytes of digits are not enough .) The characters shared by several languages are generally encoded using the same number, unless there is a reason for the etymological. In this case, each character corresponds to a number, and each digit corresponds to a character. That is, there is no ambiguity. You no longer need to record the "mode. U + 0041 always represents 'A', even if the language does not contain the 'A' character.

    In the field of computer science,Unicode(Uniform Code,Wanguo code,Single Code,Standard Wanguo codeIs a standard in the industry, which enables computers to reflect dozens of types of texts in the world. Unicode is developed based on the standard of the Universal Character Set and published in the form of books [1. Unicode is constantly expanding. More characters are inserted in each new version. For the sixth edition so far, Unicode has already contained more than 100,000 characters (in 2005, Unicode's 100,000 characters were accepted and recognized as one of the Standards) A group of code charts that can be used as a visual reference, a set of encoding methods and a set of standard character encoding, a set of enumerations that contain character features such as superscript and subscript. The Unicode Consortium is operated by a non-profit organization and leads the subsequent development of Unicode. Its goal is: the existing character encoding scheme is replaced by the Unicode encoding scheme. In particular, the existing scheme has only limited space and incompatibility problems in the multi-language environment.

    (It can be understood that Unicode is a character set, and UTF-32/UTF-16/UTF-8 are three character encoding schemes.)

    3.1.ucos & Unicode

    General Character Set(Universal character set,UCs) Is developed by ISO.ISO 10646(OrISO/IEC 10646) The standard character set defined by the standard. There have historically been two independent organizations trying to create a single character set, namely the unified code alliance composed of the International Organization for Standardization (ISO) and multilingual software manufacturers. The ISO/IEC 10646 project developed by the former and the unified Code project developed by the latter. Therefore, different standards were initially developed.

    Around 1991, participants from both projects realized that the world does not need two incompatible character sets. As a result, they began to merge the work results of both parties and work together to create a single encoding table. Since Unicode 2.0, Unicode uses the same font and character code as ISO 10646-1; ISO also promises that ISO 10646 will not assign a value to the UCS-4 code that exceeds U + 10ffff, to make the two consistent. Both projects still exist and their respective standards are published independently. However, the unified code alliance and ISO/IEC JTC1/SC2 both agree to maintain compatibility with the standard code table and closely adjust any future expansion. At the time of release, Unicode generally uses the most common fonts related to the code, but ISO 10646 generally uses the century font as much as possible.



    The above uses 4-byte numbers to express each letter, symbol, or ideograph, each digit represents a unique encoding scheme, called a UTF-32, that is, a symbol used at least in a language. UTF-32 is also calledUCS-4Is a Unicode character encoding protocol that uses 4 bytes for each character. In terms of space, it is very inefficient.

    This method has its advantages. The most important thing is that the nth character in the string can be located within the constant time, because the nth character starts from 4th × nth bytes. Although each bitwise uses a Fixed Length byte, it is not as widely used as other unicode encoding.


    Although there are many Unicode characters, most people do not actually use more than the first 65535 characters. Therefore, there is another unicode encoding method called UTF-16 (because 16-bit = 2 bytes ). The UTF-16 encodes characters in the range of 0-65535 into 2 bytes, if you really need to express Unicode characters that are rarely used within the range, you need to use some strange techniques. The most obvious advantage of UTF-16 encoding is that it is twice the space efficiency of the UTF-32, because each character only needs 2 bytes to store (out of the 65535 range ), instead of the four bytes in the UTF-32. In addition, if a string does not contain any characters in the spark layer, we can still find the nth character in the constant time, this is always a good inference until it is not true. The encoding method is:

  • If the character encoding U is less than 0x10000, that is, within 0 to 65535 in decimal format, it is expressed in two bytes;
  • If the character encoding U is greater than 0x10000, since the Unicode encoding range is at most 0x10ffff, there are 0xfffff encodings between 0x10000 and 0x10ffff, that is, 20 bits are required to mark these encodings. Use U' to represent the value between 0-0xfffff, and use the first 10 bits as the value 0xd800 of the high and 16 bits for logical or operations, use the last 10 bits as the low position and 0xdc00 for logical or operations. The four bytes constitute the U encoding.

    There are some other non-obvious disadvantages for UTF-32 and UTF-16 encoding methods. Different computer systems store bytes in different order. This means that the character U + 4e2d may be stored as 4E 2D or 2D 4E in the UTF-16 encoding mode, depending on the system using the big-Endian) or little-Endian ). (For UTF-32 encoding, there are more possibilities for byte arrangement .) As long as the document has not left your computer, it is still safe-different programs on the same computer use the same byte order ). But when we need to transmit this document between systems, maybe on the World Wide Web, we need a way to indicate how our bytes are stored. Otherwise, the computer that receives the document cannot know whether the two 4E 2D bytes are expressed in U + 4e2d or U + 2d4e.

    To solve this problem, the multi-byte unicode encoding method defines a "byte order mark", which is a special non-printable character, you can include it at the beginning of the document to indicate the byte sequence you are using. For the UTF-16, the byte order mark is u + feff. If you receive a UTF-16-encoded document starting with a byte FF Fe, you can determine that its byte order is one-way; if it starts with Fe ff, you can determine that the byte order is reversed.


    UTF-8(8-bit Unicode Transformation Format) is a variable-length character encoding (fixed-length code) for Unicode and also a prefix code. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes the software that originally processes ASCII characters do not need to or only need to make a few modifications, you can continue to use it. As a result, it has gradually become an application for storing or transmitting text in e-mails, web pages, and other applications, with the preferred encoding. The Internet Engineering team (IETF) requires that all Internet protocols support UTF-8 encoding.

    The UTF-8 uses one to four bytes to encode each character:

  1. 128 US-ASCII characters are encoded in only one byte (UNICODE ranges from u + 0000 to U + 007f ).
  2. The Latin, Greek, Spanish, Armenia, Hebrew, Arabic, Syrian, and letters with additional symbols must be encoded in two bytes (UNICODE ranges from u + 0080 to U + 07ff ). ).
  3. Other characters in the basic multi-text plane (BMP) (which contains most common words) are encoded in three bytes.
  4. Other rarely used Unicode secondary Flat Characters are 4-byte encoded.

    It is very effective in handling frequently used ASCII characters. It is no worse than UTF-16 in processing extended Latin character sets. For Chinese characters, it is better than UTF-32. At the same time, (You have to trust me in this article, because I am not going to show you its mathematical principles .) By the nature of bit operations, the use of UTF-8 no longer has the problem of byte order. A UTF-8-encoded document is the same bit stream between different computers.

    In general, it is impossible for a unicode string to display the length required by the number of vertices, or to display the position of the cursor in the text buffer after the string; the combination of characters, font width, non-printable characters, and text from right to left are all attributable to them. So although the relationship between the number of characters and the number of vertices in a UTF-8 string is more complex than that in a UTF-32, there are very few differences in reality.


  • The UTF-8 is a superset of ASCII. Because a pure ASCII string is also a valid UTF-8 string, the existing ASCII text does not need to be converted. Software Designed for traditional extended ASCII character sets can often be used with UTF-8 without modification or modification.
  • Sorting UTF-8 using standard byte-oriented sorting routines produces the same result as sorting based on Unicode code points. (Although this is only useful for a limited amount, it is unlikely that there are still acceptable textual order in any particular language or culture .)
  • Both UTF-8 and UTF-16 are standard encodings for Extensible Markup Language documentation. All other encodings must be specified through explicit or text declaration.
  • Any byte-oriented string search algorithm can be used for data in the UTF-8 (as long as the input is composed of only complete UTF-8 characters ). However, you must be careful with regular expressions or other structures that contain character records.
  • A UTF-8 string can be reliably identified by a simple algorithm. That is, the possibility of a string representing a valid UTF-8 in any other encoding is very low and decreases as the string length increases. For example, the character values C0, C1, F5, and FF never appear. For better reliability, you can use regular expressions to Count Invalid overhead and substitution values (see W3 FAQ: regular expressions that validate UTF-8 strings on multilingual forms ).


    Because each character uses a different number of bytes, the nth character in the search string is an O (n) complex operation-that is, the longer the string, it takes more time to locate specific characters. At the same time, bitwise conversion is also required to encode characters into bytes and decode them into characters.

    4. Accept-charset/accept-encoding/accept-language/Content-Type/content-encoding/content-language

    In HTTP, the message header related to the character set and character encoding is accept-charset/Content-Type, in addition, the main area distinguishes accept-charset/accept-encoding/accept-language/Content-Type/content-encoding/content-language:

    Accept-charset: indicates the character set received by the browser. This is the various character sets and character encoding described earlier in this article, such as gb2312, UTF-8 (charset usually includes the corresponding character encoding scheme );

    Accept-encoding: indicates the encoding method received by the browser. It usually specifies the compression method, whether compression is supported, and what compression method (gzip, deflate) is supported. (Note: this is not only character encoding );

    Accept-language: the browser declares the language it receives. Differences between a language and a character set: Chinese is a language, and Chinese has multiple character sets, such as big5, gb2312, and GBK;

    Content-Type: the Web server informs the browser of the type and character set of the object to be responded. For example, Content-Type: text/html; charset = 'gb2312'

    Content-encoding: indicates the compression method (gzip, deflate) used by the Web server to compress the objects in the response. Example: Content-encoding: Gzip

    Content-language: the language of the object that the Web server tells the browser to respond.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.