Fix the character set and character encoding

Last Update:2016-06-16 Source: Internet

Author: User

Tags coding standards

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is a character set?

Characters: text, symbols. Contains various national text, punctuation, graphics, numbers, etc.

Character Set: A collection of multiple characters (multiple text, a collection of symbols), with different character sets containing different numbers of characters.

2. What is character encoding?

Character encoding: A collection of character sets, not for the network transmission, the computer wants to accurately process and storage of various character sets of text, symbols, need to be character encoding, 010101 of the combination of which character set of which text, Character encoding is the conversion of text and symbols to a computer-acceptable number for storage and network transmission, called character encoding .

Simply put, the character encoding is the binary number that corresponds to the character in the character set . The encoding of characters is the basis of information exchange.

Which characters, symbols and letters will be included in the standard, this standard is called the character set.

Specifies that each "character" is stored in one byte or multiple bytes, and that this rule is called "encoding".

Each country and region in the development of coding standards, "character set" and "encoding" are both customized. So we usually say "character set", such as: Gb-2312,ascii, and so on, contains both the character set, also contains the character encoding.

Common character sets and character encodings:

ASCII Character Set & encoding:

ASCII (American Standard Code for Information Interchange, U.S. Information Interchange standards codes) is a set of computer coding systems based on the Latin alphabet , mainly used to display modern English . It is currently the most common single-byte encoding system.

ASCII Character Set : mainly includes control characters (carriage return, backspace, line wrapping, etc.); display characters (English uppercase and lowercase, Arabic and Latin).

ASCII encoding: A rule that converts an ASCII character set to a digital system acceptable to a computer. using 7-bit (bits) to represent a character, the number of characters that can be represented is 7 square = 128 characters, in order to be able to represent more European characters commonly used characters to extend ASCII, the ASCII extended character set uses 8 bits (BITS) to represent a character, which can represent 2 of the 8 parties = 256 characters.

ASCII disadvantage:

The biggest disadvantage of ASCII is that only 26 basic Latin letters, Arabic numerals and English punctuation marks can be displayed, so it can only be used to display modern American English, and the upgraded version of Eascii, although it can be expressed in Western European languages, is still not available for other languages, so now Apple computers have switched to Unicode As a character encoding.

gb2312,gb12345 Character Set & encoding:

After the invention of computer for a long time, just applied in some western developed countries, such as the United States, ASCII character set and encoding can meet these requirements, when the computer enters celestial, in order to display Chinese on the computer, a set of encoding rules must be designed to convert the Chinese characters to the number of digital systems acceptable to the computer.

Brick home rules greater than 127th after the cancellation of the symbol, the stipulation: less than 127 character meaning is unchanged, that is, or the expression of English , but two more than 127 words connect prompt together, it represents a Chinese character, in other words, a Chinese character occupies two bytes , It can represent about 7,000 or more Simplified Chinese characters, and in these codes, mathematical symbols, Roman Greek letters, and Japanese kana are all compiled.

Full-width and half-width?

Full-width and half-angle refers to the range of characters other than kanji (punctuation, letters, numbers, etc.) occupy the position , on the computer screen, a Chinese character to occupy two English characters position, people put an English character occupies a position called "half angle" , the position of a Chinese character is referred to as the "full angle". In the Chinese character input, the system provides "half-width" and "full-width" Two different input states, but for English letters, symbols and numbers of the general characters are different from Chinese characters, in the half-width state output, they are treated as English characters, and in the full-width state, they can be used as Chinese characters processing;

When we use the input method, switch the input method to the English state input to the screen symbol is this:,.!?

Switch the IME to Chinese state input to the symbol on the screen like this:,.!?

！！！ English state input symbols, alphanumeric are half-width, the Chinese state of the output of the symbol, the alphanumeric is full-angle!!!

Great Unicode: (Universal Code)

Like China, when the computer into the world, countries have to formulate a language-appropriate character encoding, so that in local use is no problem, transmission through the network to other countries, encoding can not correspond to each other, there will be garbled phenomenon.

To solve this problem, the international organization has developed a great Unicode character set, which is included in the languages of the world, and sets a uniform and unique numeric number for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing.

It uses 4-byte numbers to represent each letter, symbol, or text . each number represents the only symbol that is used in at least one language. (The characters, numbers, and symbols of the world add up to more than 2 bytes (2 of the 16-square =65535), so it is represented by 4 bytes) where the character that is common to several languages is usually encoded using the same number.

Unicode is constantly expanding, and each new version will insert more characters, so far it has developed to version sixth, and Unicode already contains more than 100,000 characters. The Unicode organization moves by a nonprofit organization and dominates the subsequent development of Unicode, with the goal of replacing the existing character encoding scheme with a Unicode encoding scheme to resolve incompatibilities between different character sets.

Note: Unicode is a character set, and Utf-32/utf-16/utf-8 is a three character encoding scheme for the Unicode character set.

This article is from "the days when those tumultuous left." "Blog, be sure to keep this provenance http://linuxzj.blog.51cto.com/6160158/1789927

Fix the character set and character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More