Coding and decoding from a python perspective

Source: Internet
Author: User

Lead:

Python2 and Python3, because of the default character set of the differences caused by the trouble, it is a programmer's nightmare! To completely say goodbye to this problem, you need to understand coding and decoding in essence.

Why do I have to encode?

For Chinese who do not speak English, translating English into Chinese, this is called decoding, and translating Chinese into English will naturally be coded!

The same is true of computers.

Computers can only recognize 0 and 1, and any text for a computer is a combination of 0 and 1. But we humans can understand this sort of combination of 0 and 1!

Nature will need to convert 0 and 1 text into words we can read, such as Chinese, English, etc.

And this mapping of 0 and 1 to text is called decoding, and the mapping of text to 0 and 1 is encoded naturally!

With a mapping relationship, it's natural to have something like "table" to record this mapping relationship! This "table" is what we often call the character set (also called encoding set, abbreviated code), such as ASCII encoding, utf-8 character set.

It is customary for us to call words that human beings can understand, which are called character numbers (that is, byte stream), and the character set is the mapping of characters and character numbers.

Why does it have garbled characters?

As we mentioned above, there are a number of character sets, so the problem comes.

For example, when encoding, I use the utf-8 character set for text to 0 and 1 mapping, but decoding, and using GBK character set for 0 and 1 to the text mapping,

The mapping rules for 2 character sets are not the same! The result is naturally garbled!

It's like I bought the Zhang San home lock, but with John Doe home with the key to open the door, can open the door to hell!

(In fact, for the computer, there is no garbled this argument!) For specific characters, the permutation combination of 0 and 1 is the only constant, and becomes the mapped text. )

About file Encoding

Do not know if anyone has such a question, the CPU can only operate 0 and 1 of the device, is how to handle text, pictures, video and sound these resources?

In fact, this is about file encoding.

The following text is from the user's answer:

DJ Hitori
Links: https://www.zhihu.com/question/27805272/answer/74539468
Source: Know
copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
 Text: There are several criteria that can be usedtorepresent a character with a 8 to 16-bit binary number, such as an ASCII with a 8-digit alphanumeric representation in English, and a 16-digit Unicode representation of all characters in almost all languages. In ASCII, for example, this standard uses the number 01000001 to denote uppercase A, so a word processing software sees this number knowing it is a, then asks a font file what the letter a looks like, then draws it on the screen and you see a. Image: The simplest format is called "Bitmap" (BMP), a 24-bit binary number represents a point in the image, the 24-digit 8-bit indicates how much red the point is, 8 bits for how much green, 8 bits for how much blue, three primary colors together can represent almost all colors. Many of the 24-digit numbers are linked together, which is a lot of points, so they form an image. In addition, there are other formats that represent the same amount of image content, such as JPG, PNG, and so on with fewer numbers. Sound: The sound is a wave, is a mathematical continuous function, and the computer can not understand the continuous function, so sampling 44,100 times per second, the 1 second sound into 44,100 numbers recorded, this is the process of recording. When playing back, give these figures to the sound card, sound card control speakers according to the amplitude of these numbers vibrate, it makes a sound. Similarly, in addition to 44,100 numbers (this is the WAV format) there are other formats with fewer numbers to represent the same number of sounds, such as MP3, Ogg, and so on. Video: Now that you have the standard for images and sounds, every second 60 images plus 1 seconds of sound make up 1 seconds of video content. But the amount of data is unusually large, so no one is doing it, and scientists have invented various coding methods that show the same amount of video content as very, very few (relative to uncompressed) numbers, such as H. Various software control CPU according to a variety of standards understand the multimedia content, calculate the screen each point should be what color (and bitmap), and then give these calculations to the video card, the video card to the number of colors to the screen, this is called a refresh. In general, refresh 60 times per second so you can see the multimedia content you've opened. 

After understanding the great environment for encoding and decoding, look at the encoding and decoding in Python:

String type

There are 2 kinds of string types in Python2, Unicode and Str.

>>> s ='China'>>>s'\XE4\XB8\XAD\XE5\X9B\XBD'>>> u = u'China'>>>UU'\U4E2D\U56FD'>>>type (s)<type'Str'>>>>type (u)<type'Unicode'>

The string type in Python3 is only one of Str.

>>> s ='China'>>>s'China'>>>type (s)<class 'Str'>>>> u = u'China'>>>u'China'>>>type (u)<class 'Str'>

Default Character Set

The default character set in Python2 is ASCII, where the default Chinese character set is utf-8.

>>> sys.getdefaultencoding ()'ascii'

Python3, regardless of the text, the default character set is Utf-8.

>>> sys.getdefaultencoding ()'utf-8'

About encoding and byte throttling

The result of the encoding is a byte stream.

Python2,str is a byte stream

 >>> u.encode ("  GBK   " )   " \xd6\xd0\xb9\xfa   " >>> U.encode ("  utf-8   " )  Span style= "COLOR: #800000" > " \xe4\xb8\xad\xe5\x9b\xbd  "  >>> s   " Span style= "COLOR: #800000" >\xe4\xb8\xad\xe5\x9b\xbd   "
>>> s = b‘中国‘>>> s‘\xe4\xb8\xad\xe5\x9b\xbd‘>>> type(s)<type ‘str‘>>>> bytes(s)‘\xe4\xb8\xad\xe5\x9b\xbd‘

Python3, can not directly define the Chinese text throttling

>>> S.encode ('utf-8') b'\xe4\xb8\xad\xe5\x9b\xbd '
>>> s = b‘中国‘  File "<stdin>", line 1SyntaxError: bytes can only contain ASCII literal characters.

About decoding

The result of decoding is STR (PYTHON3) or Unicode (Python2)

Python2

>>> S.decode ('utf-8') u'\u4e2d\u56fd' >>> uu'\u4e2d\u56fd'

Python3

>>> ab'\xe4\xb8\xad\xe5\x9b\xbd'>>> a.decode () ' China '

Detection

Can be used isinstance() to determine the type

>>> isinstance (A,STR) True>>> isinstance (b'qq', str) True 

About errors

' Utf-8 ' codec can't decode byte 0xd6 in position 0:invalid continuation byte

The above error occurs because the character set used in encoding and the character set used for decoding are inconsistent.

Expand:

\0x: When the output number is converted to 16 binary only 1 bits, in front of 0, such as 0a, other conditions according to the actual situation output. \x: The actual number of digits output that is converted to 16 in accordance with the output count. In addition, lowercase x and uppercase x are also somewhat different, lowercase x output lowercase symbol 16 Decimal, uppercase X is output uppercase (mainly for abcdef six bits)

Coding and decoding from a python perspective

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.