ASCII and Universal code

Source: Internet
Author: User
Tags control characters ftp protocol

What is ASCII

The computer initially uses the 0101来 representation of the memory and the machine code. How to use the bit in memory to represent text has been troubling people, after all, the main message of human display is the text, rather than the bitter 0101. Later the invention of the ASCII code successfully solved the "partial" problem. The ASCII code is a solution to the problem of representing text in digital form.

The ASCII code is the U.S. Information Interchange Standard Code "American Standards Code for information interchange". Currently it has been established as an international standard by the International Organization for Standardization (ISO), known as ISO 646. Applies to all Latin letters, ASCII code has 7-bit code and 8-bit code two forms. In the computer's storage unit, an ASCII value of one byte (8 bits)

The 7-bit ASCII code is encoded with a seven-bit binary number and can represent 128 characters. Its highest bit (B7) is used as the parity bit. The so-called parity check, refers to the code in the process used to verify whether there is a method of error, the general sub-parity check and parity two. Odd check rules: The correct code in one byte of the number of 1 must be odd, if not odd, the highest bit B7 Tim 1; Parity rule: The correct code in a byte of 1 must be an even number, if not even, the highest bit B7 add 1.

No. 0 to 32nd and 127th (total 34) are control characters or communication-specific characters, such as: LF (line feed), CR (carriage return), FF (page feed), DEL (delete), BEL (ringing), etc.; Communication special characters: SOH (Wen tou), EOT (end), ACK (confirmation), etc.

33rd to 126th (a total of 94) is a character, of which 48th to 57th is 0~9 10 Arabic numerals; 65~90 is 26 uppercase English letters, 97~122 is 26 lowercase English letters, the remainder is some punctuation marks, arithmetic symbols, etc.

Let's take a look behind the scenes to see how the ASCII code represents the text in digital form. Give 2 examples:

such as ASCII code ' A '--its memory storage byte 2 binary is represented as "01000001"---its 16 binary value is 0x41---its 10 binary value is 65 (here the value is actually ' A ' in the ASCII Code table number);

Verification process:

char c = ' A ';

printf ("%c\n", c);

printf ("%x\n", c);

printf ("%d\n", c);

Another example is ASCII code ' 6 '-its memory storage byte 2 is represented as "00110110"---its 16 binary value is 0x36---its 10 binary value is 54 (here the value is actually ' 6 ' in the ASCII Code table number);

Verification process:

char c = ' 6 ';

printf ("%c\n", c);

printf ("%x\n", c);

printf ("%d\n", c);

A string is stored sequentially in memory in the ASCII code-by-character basis, and we generally do not need to do special conversions when transferring strings.

There are two types of communication in the FTP protocol, one of which is the ASCII code, which is the text mode. Here are also examples: for example, we want to transfer the value of 123, 123 value is represented as 0x7b by 16, the binary is represented as 01111011, then the binary mode of communication, 01111011 is the actual transmission of data, but if the ASCII code to communicate, it is completely different, if you also transmit 01111011, the other side of the obtained will be ' {' (the corresponding ASCII code is represented by the 16 binary 7b). The correct way is to convert the number on each of the 123 digits to its corresponding ASCII code, and then transmit it. The ASCII codes for the ' 1 ', ' 2 ', and ' 3 ' correspond to 0x31, 0x32, and 0x33, respectively, in 16 binary notation. The data to be transmitted after this combination should be "001100010011001000110011".

What is Unicode

Unicode codes are also an international standard encoding, with two-byte encoding, incompatible with ANSI code, and two-byte notation for ASCII characters.

At first, the characters are represented by ASCII code. These characters can be letters. Digital. Punctuation marks and control characters. It is not a problem to use this code to denote characters in English. But to indicate other language words such as. Arabic. Chinese. Japanese. Weaving Harvin ... Must be expanded

For Chinese, you must use two bytes (byte) to represent a character, and the first byte must be greater than 127 (so we have a permit program to determine that the ASCII code is more than 127 as the condition). The above two bytes to represent a Chinese way, in the customary known as double-byte (that is, Dbcs:double-byte Character set), and the relative, the English character code is called a single-byte SBCS (single-byte Character set).

Although double-byte (DBCS) is sufficient to solve the mixed use of Chinese and English characters, it is cumbersome for different character systems to be converted by character code. For example: Chinese and English mixed situation. To address this issue, the ISO/IEC JTC1/SC2/WG2 Working Group was established in April 1984 by the International Standards Organization. For the national text, symbols for unity coding. 1991 American Multinational Corporation established Unicode Consortium. and reached an agreement with WG2 in October 1991. Use the same coded word set. Unicode is currently using a 16-bit encoding system. Its character set content is the same as ISO10646 BMP (Basic multilingual Plane). Unicode passed DIS (DRAF International Standard) in June 1992. The current version V2.0 released in 1996. The content contains 6,811 symbols. 20,902 characters. Hangul Pinyin 11,172. There are 6,400 characters in the font area. Keep 20,249 of them. A total of 65,534.

The most significant difference between Unicode and today's popular code page is that Unicode is two bytes of full encoding. It also uses a two-byte representation for ASCII characters. The code page is determined by the high-byte range of values to be ASCII characters. is also the high byte of Chinese characters. If data corruption occurs. Somewhere in the content of the destruction. Will cause confusion in the subsequent Chinese characters. Unicode uses two bytes to represent one character. The most obvious benefit is that it simplifies the process of processing Chinese characters.

The original target of Unicode. is to provide a mapping for more than 65000 characters with 1 16-bit encodings. But that's not enough. It cannot cover all the historical text. Also does not solve the transmission problem (implantation Head-ache ' s). Especially in those web-based applications. So. Unicode has three sets of encoding methods with some basic reserved characters. They are utf-8,utf-16 and UTF-32, respectively. As the name suggests. In the UTF-8. Characters are encoded in a 8-bit sequence. Use one or several bytes to represent a character. The greatest benefit of this approach. Is that UTF-8 retains the encoding of the ASCII character as part of it. For example. In UTF-8 and ASCII. The code for "A" is 0x41. UTF-16 and UTF-32 are 16-bit and 32-bit encodings of Unicode, respectively. Take into account the original purpose. Generally speaking, Unicode means UTF-16.

Unicode is characterized by:

Regardless of which country the character code is expressed in two byte, for example "A" in Unicode is a combination of 16 binary 41 and 00, that is 4100, high 41 (converted to ASCII code is 65=a), Windows nt/2000 in Unicode to represent the character set, For example, you can see that the SQL file generated in MS SQL Server can be saved in Unicode or in normal format, and if you save it in Unicode, many software on the 95/98 platform will not be able to read its format correctly.

You can also notice the API definition in 95/98, where many names end with a, for example Writeprofilestringa

In the nt/2000 operating system, two sets of APIs are provided, and the other command is WRITEPROFILESTRINGW, and the API ending with W is only available for nt/2000. (using the API function with the end of W in NT is faster than the end of a, because the conversion process of Unicode and DBCS/SBCS is omitted)

In this way, we often need to determine the length of the function of the string, NT and 95/98 execution results are different, as follows

95/98: Len ("ABC China") returns 7 (because each Chinese as a two ASCII code view)

nt/2000: Len ("ABC China") returns 5 (because each word nonalphanumeric as a Unicode)

ASCII and Universal code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.