Python encoding and decoding

Last Update:2018-03-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Encoding and decoding first, it is clear that the information stored in the computer is binary encoding/decoding is essentially a mapping (correspondence), such as ' a ' with ASCII encoding is 65, the computer is stored in 00110101, but the display can not display 00110101, or to display ' A ', but the computer how to know 00110101 is ' a ', this need to decode, when the choice of ASCII decoding, when the computer read 00110101 to the corresponding ASCII table found to be ' a ', it is displayed as ' a ' encoding: Real character and binary string corresponding relationship, Real characters → binary string decoding: binary strings correspond to real characters, binary strings → real characters ascii & UTF-8 well-known ASCII with 1 bytes 8 bit bit represents one character, the first is 0, the character set is obviously not enough UnicodeCoding System is designed to express any language, in order to prevent the storage of redundancy (for example, the corresponding ASCII code part), it uses the variable length encoding, but the variable length encoding to decoding brings difficulties, can not be judged to be a few bytes to represent a character UTF-8is a prefix for Unicode variable length encoding design, which can be judged by a number of bytes to represent a character if the first bit of a byte is 0, then the byte is a single character, or if the first bit is 1, how many bytes are in a row, and how many byte is the current character. For example, "Strict" Unicode is 4E25 (100111000100101), 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx ". Then, from the "strict" the last bits start, sequentially from the back to fill in the format of the X, high 0, get "strict" UTF-8 code is "11100100 10111000 10100101". Decoding and encoding in python in Python, encoding decoding is actually the conversion between different encoding systems, by default, the conversion target is Unicode, that is, encoding unicode→str, decoding Str→unicode, Where Str refers to the byte stream and Str.decode is to decode the bytes flow str in the given decoding mode, and convert it into Utf-8 form, U.encode is to convert the Unicode class to a byte stream by the given encoding method STR Note that the Unicode object that calls the encode methods is generated by the byte stream, called The Decode method is that the Str object (byte stream) generates a Unicode object, and if the Str object is called encode it is decode to a Unicode object and then encode by default by the system's default encoding. Ignoring the middle default decode often leads to error writing code when you just remember str byte stream call Decode,unicode object call

123	`s` `=u‘严‘sprinttype(s), s`

The first line defines a Unicode object (not UTF8) second row output U ' \u4e25 ' third line output <type ' Unicode ' > strict

123	`u` `=s.encode(‘utf8‘)uprinttype(u),u`

If I use S.encode (' UTF8 '), then S will use UTF-8 encoding and save the encoding result as a byte stream output ' \xe4\xb8\xa5 ' third line output <type ' str ' > Juan also note that the default encoding format for the terminal is GBK , Windows CMD can be viewed and changed through CHCP, or it can be modified to the registry by default encoding (codepage under HKEY_CURRENT_USER Console or PowerShell), 936 for Simplified Chinese, 65001 for UTF8 , both can display Chinese, but in order to facilitate Chinese input, I set it by default to 936 when the print function is called to output the content to the terminal, the Unicode object is converted to the encoding output of the terminal, as the result of the first print above is normal, when the print UTF8 byte stream is Terminal by its default GBK decoding display will be a problem, here happens ' \xe4\xb8 ' for GBK under the "trickle"

12	`t` `=s.encode(‘utf8‘).decode(‘utf8‘)t`

The second row output U ' \u4e25 ' file encoding format to save the text is also encoded format, such as TXT file save selectable ASCII, UTF8, etc., the py file can be first two linesNote Encoding #-*-Coding:utf-8-*-reading files in python

12	`fr` `=open(‘encode.py‘,‘r‘)fstr` `=fr.read()`

Just remember that fstr is a byte stream, other operations see above can note: The above operations are completed under CMD or PowerShell, there will be a problem in Python's own interpreter, S=u ' Hello ', then S, the display is Unicode object, But the code is GBK, not Unicode.

Python encoding and decoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python encoding and decoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support