Encoding and decoding first, it is clear that the information stored in the computer is binary encoding/decoding is essentially a mapping (Correspondence Relationship), such as ' A ' ASCII encoding is 65, the computer is stored in 00110101, but the display can not be displayed00110101, or to show ' a ', but how does the computer know00110101 is ' a ', which needs to be decoded when choosing to decode with ASCII when the computer reads00110101 when the corresponding ASCII table is found to be ' a ', it is displayed as ' a ' Encoding: True character and binary string correspondence, real character → binary string decoding: binary string corresponding to real character, binary string → Real character ascii & UTF-8 well-known ASCII with 1 bytes 8 bit bit represents a character, the first is all 0, The character set represented is obviously not enough
UnicodeCoding System is designed to express any language, in order to prevent the storage of redundancy (for example, the corresponding ASCII code part), it uses the variable length encoding, but the variable length encoding to decoding brings difficulties, can not be judged to be a few bytes to represent a character
UTF-8is a prefix for Unicode variable length encoding design, which can be judged by a number of bytes to represent a character if the first bit of a byte is 0, then the byte is a single character, or if the first bit is 1, how many bytes are in a row, and how many byte is the current character. For example, "Strict" Unicode is 4E25 (100111000100101), 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx ". Then, from the "strict" the last bits start, sequentially from the back to fill in the format of the X, high 0, get "strict" UTF-8 code is "11100100 10111000 10100101". decoding and encoding in Pythonin Python, encoding decoding is actually a conversion between different encoding systems, by default, the conversion target is Unicode, which is encoded UNICODE→STR, decoding str→unicode, where Str refers to the byte streamWhile Str.decode is decoding the byte stream str in the given decoding mode and converting it into utf-8 form, U.encode is converting the Unicode class to a byte stream by the given encoding method STR notices that the Unicode object is generated by the word stream, and the Decode method is called S TR Object (Byte streamA Unicode object is generated, and if the Str object calls encode defaults to decode to Unicode objects by default, ignoring the middle default decode often results in an error.Write your own code just remember STR byte stream call Decode,unicode object call
123 |
s = u ‘严‘ s print type (s), s |
The first line defines a Unicode object (not UTF8) second row output U ' \u4e25 ' third line output <type ' Unicode ' > strict
123 |
u = s.encode( ‘utf8‘ ) u print type (u),u |
If I use S.encode (' UTF8 '), then S uses UTF-8 encoding and saves the encoded result as a byte stream output ' \xe4\xb8\xa5 ' third line output<type ' str ' > JuanIt is also important to note that the default encoding format for the terminal is GBK,windows CMD can be viewed and changed through CHCP, or it can be modified to the registry by default encoding of the terminal (HKEY_CURRENT_USER Console or PowerShell under codepage), 936 is Simplified Chinese, 65001 is UTF8, both can display Chinese, but for the convenience of Chinese input, I set it by default to 936When the print function is called to format the output to the terminal, the Unicode object is converted to the encoding output of the terminal, as the result of the first print above is normal, when the print UTF8 byte stream, the terminal by its default GBK decoding display will be a problem, here happens ' \xe4\ Xb8 ' "Juan" under the GBK
12 |
t = s.encode( ‘utf8‘ ).decode( ‘utf8‘ ) t |
Second row outputu ' \u4e25 ' The encoding format of the file is also encoded when saving text, such as TXT file save selectable ASCII, UTF8, etc., read files in Python
12 |
fr = open ( ‘encode.py‘ , ‘r‘ ) fstr = fr.read() |
just remember Fstr .is the byte stream, the other operation see above can Note: The above operations are done under CMD or PowerShell, there is a problem with Python's own interpreter, S=u ' Hello ', and then S, although the Unicode object is displayed, but the encoding is GBK instead of the Unicode reference
- Introduction to character encoding http://blog.csdn.net/trochiluses/article/details/8782019
- chcp http://baike.baidu.com/link?url=_ Qajtlxmrjod5ppv8ykh7om7uhqtucqud5wqawfrtmcmg3ii3f3s7r11xd6rqf6zkzh_ljz-1dwzexyxei2_lq
- Python character encoding and decoding http://blog.csdn.net/trochiluses/article/details/16825269
Encoding and decoding in Python