1. ANSI Code
Both Unicode and ANSI are representations of character codes. To enable the computer to support more languages, you typically use the 0x80~0xff range of 2 bytes to represent 1 characters. For example: Chinese characters ' in ' in the Chinese operating system, using [0xd6,0xd0] These two bytes of storage. Different countries and regions have developed different standards, resulting in GB2312, BIG5, JIS and other coding standards. These use 2 bytes to represent a character of a variety of Chinese character extension encoding, called ANSI encoding.
in Simplified Chinese system, ANSI encoding represents GB2312 encoding
, the ANSI encoding stands for JIS encoding under the Japanese operating system. Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text.
The ANSI encoding represents an English character with a byte, representing Chinese with two bytes, and Unicode is represented by two bytes, whether it is an English character or Chinese.
The data inside the computer is eventually stored in binary form, with each bits (bit) having 0 and 12 states, and 8 bits (bit) combinations called a byte (byte), so a byte can be combined in 256 state, that is, from 00000000 to 11111111.
In the 70 's, the American National Standards Association (American Nation Standard Institute, ANSI) developed an ASCII code (American standardized code for information Interchang E, US standard Information Interchange Code): Use 7-bit binary numbers altogether 128 combinations to represent all uppercase and lowercase letters, numbers 0 through 9, punctuation, and special control characters used in American English.
No. 0 to 32nd and 127th (total 34) are control characters or communication-specific characters, such as the control: LF (line feed), CR (carriage return), FF (page feed), DEL (delete), BEL (ringing), etc.
33rd to 126th (a total of 94) is a character, of which 48th to 57th is 0~9 10 Arabic numerals; 65~90 is 26 uppercase English letters, 97~122 is 26 lowercase English letters, the remainder is some punctuation marks, arithmetic symbols, etc.
PS: In the computer's storage unit, an ASCII value occupies one byte (8 bits) and its highest bit (B7) is used as the parity bit. The so-called parity check, refers to the code in the process used to verify whether there is a method of error, the general sub-parity check and parity two. Odd check rules: The correct code in one byte of the number of 1 must be odd, if not odd, the highest bit B7 Tim 1; Parity rule: The correct code in a byte of 1 must be an even number, if not even, the highest bit B7 add 1.
The last 7 bits in a byte can only represent 128 different characters, and English is sufficient for these characters, but it is not enough to represent other languages. For example, in French, there is a phonetic symbol above the letter and cannot be expressed in ASCII. As a result, some countries use the highest bits of bytes that are idle to incorporate new symbols. In this way, you can represent up to 256 symbols, which is the extended ASCII code, so now there are 7-bit and 8-bit ASCII codes, and the extended ASCII code allows the 8th bit of each character to be used to determine additional 128 special symbol characters, foreign letters, and graphic symbols. However, in any case, 0~127 represents the same character, except that the 128~255 is different.
PS: Query the following 128 ASCII corresponding characters tips: Create a new text document, press ALT + to query the code value (note, here is the decimal), released to display the corresponding characters.
But even extending to 256 symbols is not enough, for example, Chinese characters are statistically more than 100,000, and the same values are expressed in different languages, such as 130 in French, while in Greek they represent Gimel. The Unicode was then born.
Unicode character set encoding is the abbreviation of Universal Multiple-octet Coded Character set Universal Multi eight-bit coded character set, which is a character encoding scheme developed by international organizations that can accommodate all the words and symbols in the world.
Unicode is a character encoding that is used on a computer. It sets a uniform and unique binary encoding for each character in each language to meet the requirements for cross-language, cross-platform text conversion, and processing. The Unicode standard always uses hexadecimal digits, and in writing, prefix "u+", such as the letter "a" with the encoding 004116 and the character "?". The encoding is 20AC16. So the code for "A" is written as "u+0041". But Unicode is just a set of symbols that specifies only the binary code of the symbol, but does not specify how the binary code should be stored.
It turns out that Unicode is not efficient for characters that can be represented in ASCII, because Unicode occupies a much larger space than ASCII, while a high byte of 0 for ASCII is useless to him. In order to solve this problem, there are some intermediate format character sets, they are called the Universal conversion format, namely UTF (Universal transformation format). The existing UTF formats are: UTF-7, UTF-7.5, UTF-8, UTF-16, and UTF-32.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding (fixed length code) for Unicode and is a prefix code. It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII, which allows the software that originally handles ASCII characters to continue using without or requiring only a few modifications. As a result, it is gradually becoming the preferred encoding for e-mail, Web pages, and other applications that store or transmit text.
The UTF-8 encodes Unicode with 1~4 bytes. The encoding from Unicode to UTF-8 is as follows:
000800-00ffff║1110xxxx 10xxxxxx 10xxxxxx
010000-10ffff║11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
For the characters between 0x00-0x7f, the UTF-8 encoding is exactly the same as the ASCII encoding;
Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, and other alphabetic characters with additional symbols require two byte encoding (Unicode range);
Other basic Multilingual (BMP) characters (which contain most of the characters commonly used) use three-byte encoding;
Other rarely used Unicode auxiliary plane characters use four-byte encoding;
The maximum length of a UTF-8 encoding is 4 bytes. As can be seen from the table above, the 4-byte template has 21 X, which can hold 21-bit binary digits. The Unicode maximum code bit 0X10FFFF is also only 21 bits.
UTF-8 Parsing algorithm:
If the first bit of byte (byte) is 0, then B is the ASCII code, and byte is the independent representation of a character;
If the first bit of byte (byte) is 1 and the second bit is 0, then byte is a non-ASCII character (the character is represented by more than one byte) and is not encoded as the first byte of the character;
If the first two bits of the byte (byte) are 1 and the third bit is 0, then byte is the first byte in a non-ASCII character (the character is represented by more than one byte), and the character is represented by two bytes;
If the first three bits of the byte (byte) are 1 and the fourth bit is 0, then byte is the first byte in a non-ASCII character (the character is represented by more than one byte), and the character is represented by three bytes;
If the first four bits of the byte (byte) are 1 and the fifth bit is 0, then byte is the first byte in a non-ASCII character (which is represented by more than one byte), and the character is represented by four bytes ;
ANSI, ASCII, Unicode, and UTF-8 encoding