Four. Python character encoding

Source: Internet
Author: User
Tags uppercase letter

1. What is character encoding

Because the computer is also energized, we learned in junior high school physics circuit, computer design is also used in circuit design, similar to the lamp switch, only power on or off, that is, 0 is not through, 1 represents the pass. Therefore the computer knows only 0 and 1.

For example, the computer is a foreigner, then how can we communicate with it, according to the Convention, there must be an interpreter. The translation is to tell the computer what we want to say to the computer, 0 and 1. The process of this translation is the process of coding.

2. Character encoding classification

Since computers are invented by Americans, it means that only Americans can communicate with computers. American Translator is (ASCII coding ), then we Chinese how to communicate with the computer, according to the Convention, to find a Chinese and computer translation, if every country to communicate with computers, it will have to find a lot of translations, this many translation is the type of character encoding.

Later China customized the GB2312 code, the Japanese custom Shift_JIS, South Korea custom EUC-KR, when countries have national standards, when many countries use computers to communicate, there will be conflicts, is what we often say garbled, and later in order to unify the world's text, The ISO (International Organization for Standardization) has developed a new coding specification that places world-wide text in a table called "Universal multiple-octet Coded Character Set", referred to as UCS, commonly known as "UNICODE ".

Later, because Unicode commonly uses 2 bytes to represent one character, some characters may not require 2 bytes of space, which creates a huge waste. At this point, a code is derived that, depending on the different symbols, is long and short, and you want to save a lot of space for Unicode.

2.1. ASCII code

We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.

The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

2.2 UnicodeCode

It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored. For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.

2.3 UTF-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 (characters in two-byte or four-byte notation) and UTF-32 (characters in four-byte notation), but not on the Internet. Again , the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

3. Encoding and decoding
    • File from memory brush to hard disk operation abbreviation for file (encoding process)
    • Files read from hard disk to memory for short read files (decoding process)
    • With what code, with what decoding, otherwise will appear garbled situation.

Four. Python character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.