Python encoding and decoding

Source: Internet
Author: User

First, what is the code

Coding refers to the process of converting information from one form or format to another form or format.

In computers, coding, in short, translates information that people can read (often referred to as plaintext) into information that a computer can read. As we all know, the computer can read the high and low level, that is bits (0,1 combination).

And decoding, refers to the computer can read the information into a person can read the information.

Second, the development of coding sources

Previous blogs have mentioned that since computers were first invented and used in the United States, the first people used ASCII encoding. ASCII encoding occupies 1 bytes, 8 bits, and can represent up to 2**8=256 characters.

With the development of computer, ASCII code can not meet the needs of people all over the world. Because the world languages are numerous, the characters far more than 256. So each country has its own national code on an ASCII basis.

For example, in China, the GB2312 code was designed to deal with Chinese characters, with a total of 7,445 characters, including 6,763 Chinese characters and 682 other symbols. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters.

  

However, in coding, countries "fragmented", it is difficult to communicate with each other. The Unicode encoding is then present. Unicode is a character encoding scheme developed by international organizations that can accommodate all the words and symbols in the world.

Unicode Specifies that characters are represented at least 2 bytes, so you can represent at least 2**16=65536 characters. In this way, the problem seems to be solved, people all over the world can put their words and symbols into Unicode, from then on can easily communicate.

However, at that time, the memory capacity of the computer is the size of gold, the United States and other North American countries do not accept this code. Because it increases the volume of their files, which in turn affects memory usage and affects productivity. It's embarrassing.

It is obvious that international standards are not to be seen in the United States, so the Utf-8 code was born.

UTF-8, which is compression and optimization of Unicode encoding, is no longer required to use a minimum of 2 bytes, but instead classifies all characters and symbols: the contents of the ASCII code are saved in 1 bytes, the characters in Europe are saved in 2 bytes, and the characters in East Asia are saved in 3 bytes.

In this way, everyone picking and happy.

Utf-8 is how to save storage space and traffic

When the computer is working, the in-memory data is always encoded in Unicode, and the UTF-8 encoding is used when the data is to be saved to disk or to network transmission.

In the computer, the Unicode character set of "I ' M Jack" is such an encoding table:

i0x49        ' 0x27m0x6d 0x20 Jay 0x6770 G 0x514b

Each character corresponds to a hexadecimal number (convenient for people to read, 0x for hexadecimal number), but the computer can only read the binary number, so the actual computer is represented as follows:

i0b1001001 ' 0b100111m0b1101101 0b100000 Jay 0b110011101110000 G 0b101000101001011

Because of Unicode rules, each character occupies a minimum of 2 bytes, so the actual placeholder for the above string in memory is as follows:

I00000000 01001001 ' 00000000 00100111m00000000 01101101 00000000 00100000 Jay 01100111 01110000 G 01010001 01001011

This string of characters occupies a total of 12 bytes, but compared with the binary code in English, you can find that the first 9 bits of English are 0, very wasted space and traffic.

See how utf-8 is solved:

I01001001 ' 00100111m01101101 00100000 Jay 11100110 10011101 10110000 g 11100101 10000101 10001011

The Utf-8 uses 10 bytes, compared to Unicode, and 2 bytes less. However, we rarely use Chinese in our programs, and if 90% of the content in our program is in English, you can save 45% of your storage space or traffic.

So, most of the time when storing and transmitting, UTF-8 encoding is followed

  

Iv. coding and decoding in python2.x and python3.x

1. In python2.x, there are two types of strings: STR and Unicode. Str stored bytes data, Unicode type in Unicode data

As can be seen, the STR type stores hexadecimal byte data, and the Unicode type stores Unicode data. UTF-8-encoded Chinese accounts for 3 bytes, and Unicode-encoded Chinese accounts for 2 bytes.

Byte data is commonly used to store and transmit, and Unicode data is used to display plaintext, and how to convert two data types:

Whether it is utf-8 or GBK are just a coding rules, a Unicode data encoding into byte data rules, so utf-8 encoded bytes must be decoded with utf-8 rules, otherwise there will be garbled or error cases.

Features of python2.x encoding:

Why the English splicing success, and the Chinese stitching on the error?

This is because in python2.x, the Python interpreter silently conceals the byte-to-Unicode conversion, so long as the data is all ASCII, all conversions are correct, and once a non-ASCII character sneaks into your program, the default decoding will be invalidated, causing Unicodedecodeerror's error. The python2.x encoding makes it easier to process ASCII. The price you pay is that you will fail when dealing with non-ASCII.

2. In python3.x, there are only two types of strings: str and bytes.

STR type in Unicode data, Bytse type bytes data, and python2.x than just a change of name.

Remember this quote from the previous blog post? All are UNICODE now

Python3 renamed the Unicode type to str with the old STR type have been replaced by bytes.

Python 3 The most important new feature is probably a clearer distinction between text and binary data, and no longer automatically decodes bytes byte strings. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between them particularly clear. You cannot stitch strings and byte packets, search for strings in a byte packet (or vice versa), or pass a string into a function with a byte packet (or vice versa).

Note: Regardless of the python2, or Python3, the Unicode data directly corresponds to the plaintext, and the printed Unicode data displays the corresponding plaintext (both English and Chinese)

V. Encoding of files from disk to memory

When we edit the text, the characters in memory correspond to Unicode encoding, because Unicode covers the widest range and almost all characters can be displayed. But how does the data change when we save the text on disk?

The answer is a bytes byte string encoded in some way. For example, Utf-8, a variable-length code, a good saving of space, of course, the historical product of GBK encoding and so on. So, in our text editor software, there is a default way to save files, such as utf-8, such as GBK. When we click Save, these editing software has "silently" helped us to do the coding work.

That when we open this file again, the software silently to us to do the decoding work, the data will be decoded into Unicode, and then can be rendered clear to the user! So,Unicode is closer to the user, bytes is the data closer to the computer.

In fact, the Python interpreter is similar to a text editor, which also has its own default encoding method. python2.x default ASCII code, python3.x the default utf-8, can be queried by the following ways:

Import Sysprint (sys.getdefaultencoding ())

If we do not want to use the default interpreter encoding, we need to declare the user at the beginning of the file. Remember the statements we used to have in python2.x?

#coding: Utf-8

If the Python2 interpreter to execute a utf-8 encoded file, will be the default ASCII to decode Utf-8, once the program has Chinese, natural decoding error, so we declare at the beginning of the file #coding: Utf-8, in fact, is to tell the interpreter, You should not decode this file by default encoding, but instead use Utf-8 to decode it. The PYTHON3 interpreter is much more convenient because it is encoded by default Utf-8.

Resources

1. http://www.cnblogs.com/yuanchenqi/articles/5956943.html

2. http://www.cnblogs.com/284628487a/p/5584714.html

  

Python encoding and decoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.