Character Set Practice

Source: Internet
Author: User

Character Set Practice report

20135311 Fu Dong

Character Set origin and difference:

ASCII code, which is stored in 8bit bytes, ASCII 0-31 and 127 for the control, 32-126 for visible characters, including all English letters, Arabic numerals and some other common symbols, 128-255 ASCII code is not defined.

ASCII is sufficient for English-speaking countries, but not enough for other Western European countries, so people extend ASCII to 0-255 of the range, forming the iso-8859-1 character set.

The characters in the east (such as Chinese) are much larger and cannot be represented in a single byte, so they are stored in two bytes, with the Chinese national standard character set GB2312 as an example, and its first byte is 128-255. The system can judge that if the first byte is greater than 127, then the byte immediately after the byte is combined with a total of two bytes to form a Chinese character. This character set, which is stored as a character by multiple bytes, is called a multibyte character set (multibyte charsets), and the corresponding character set, such as ASCII, that stores one character in one byte is called a single-byte character set (Singlebyte charsets). In the GB2312 character set, ASCII characters are still stored in a single byte, in other words, the ASCII is a subset of the character set.

GB2312 only contains thousands of commonly used Chinese characters, often can not meet the actual needs, therefore, people to expand it, this has our widely used GBK character set, GBK is currently the default character set of Windows and some other Chinese operating systems. It contains more than 20,000 characters, in addition to keeping and GB2312 compatible, it also contains traditional Chinese characters, Japanese characters, and Korean character. It is noteworthy that GBK is just a specification rather than a national standard, the new national standard is gb18030-2000, which is more character sets than gbk contain characters.

The advent of the universal character set defined by International standard ISO10646 (Universal Character set, or UCS) has made this situation a radical change. UCS is a superset of all other character set standards. It is guaranteed to be bidirectional compatible with other character sets. That is, if you translate any text string into the UCS format and then translate back to the original code, you will not lose any information.

To modify the default character set:

Use the command Locale-a|grep ZH_CN to view the currently owned Chinese character set and, if there is no gb2312, use the command sudo locale-gen zh_cn to install the character set, and then use the command under the current folder: Vim ~/.VIMRC set vim RunTime environment, enter the following content into the file:

Set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936

Set Termencoding=utf-8

Set Encoding=utf-8

To write files in multiple character sets:

One, with the ASCII code character set:

First, in the ASCII code table to find the content you want to write the hexadecimal code of the corresponding, I want to enter the content is "I am FDJ", the corresponding hexadecimal number is 4927616d46444a;

Next, use VI to open the file I want to enter. Enter::%! Xxd go to hexadecimal text mode. Use I to edit the text, enter 4927616d46444a. Exit edit mode, enter::%! Xxd–r returns, this thing shows what you want to see "I am FDJ" and save.

Use the Cat directive to view text content:

Second, using the GB2312 character set

The same is to find their own content to enter the GB2312 code, I want to enter the content is "Fu Dong is not rich Tokyo", find the corresponding code for B5B8 b6ac ddbc b2bb cac7 b8bb b6ab bea9;

First, the terminal's encoding mode is changed to GB2312; after modification, use the command Locale–a|grep ZH_CN to see if there is a GB2312 character set. If not, the second part of the report is used to modify the content.

Use the VI command to enter the text file, also enter::%! Xxd into hexadecimal text mode. Enter your own code into the text to::%! Xxd–r exit, then save.

Use the cat command to view:

Third, using the UTF-8 character set

Again, find out the encoding representation of what I want to enter. What I want to enter is "not rich Tokyo". The UTF-8 encoding is expressed as: e4b8b0 e698af E5AFBC e4b89c e4babc; With reference to the previous two examples, the content is output;

Character Set Practice

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.