Coding and decoding of Python

Source: Internet
Author: User

########################## character types in Python ########################### There are two types of characters in Python:# 1. STR type: A character in an ASCII table, which occupies one byte, so is also called a byte character. The literal is expressed in double quotation marks.# 2. Unicode type: The number of bytes consumed by a string is related to the encoding format used when saving. The literal is represented by double quotation marks with a "u" prefix.S=' OK, 'U=U ' Me, 'U1=U ' Me 'U2=U ' Love python 'Print' s: ',SPrint' U1: ',U1Print' U2: ',U2# parsers typically convert Unicode characters to Unicode escape sequences# escape sequence begins with "\u"Print' Repr (s): ',Repr(S)Print' Repr (U1): ',Repr(U1)Print' Repr (U2): ',Repr(U2)# This escape sequence is valid only in Unicode literalsPrint"‘\u6211‘: ",‘\u6211‘Print"U '\u6211‘: ",U\u6211‘Print‘‘# You can also create a str string with the STR () function and create a Unicode string with the Unicode () functionPrint' Type of STR (s): ',Type(Str(S))Print' Type of Unicode (s): ',Type(Unicode(S))# You can pass a Unicode string to the Unicode () functionPrint' Type of Unicode (U): ',Type(Unicode(U))# But what if we pass the Unicode () function to ' me '?Try:Print"Unicode (' Me '):",Unicode(I)ExceptUnicodedecodeerrorAsE:# error message# parser attempts to decode our incoming parameters with ASCII encoding, the reason isPrintE########################## Encode and Decode ########################### encoding is the process of converting Unicode characters to str characters in a certain encoding format# non-ASCII characters are encoded as hexadecimal escape characters in bytes# decoding uses encoding format related to settings and environmentutf8_s=S.Encode(' Utf-8 ')Utf8_u=U.Encode(' Utf-8 ')Utf8_u1=U1.Encode(' Utf-8 ')Utf8_u2=U2.Encode(' Utf-8 ')Print' utf8_s: ',SPrint' Repr Utf8_u: ',Repr(S)Print' Utf8_u: ',UPrint' Repr Utf8_u: ',Repr(Utf8_u)Print' UTF8_U1: ',Utf8_u1Print' Repr utf8_u1: ',Repr(Utf8_u1)Print' UTF8_U2: ',Utf8_u2Print' Repr utf8_u2: ',Repr(Utf8_u2)# If we have non-ASCII characters in our str literal, the parser will encode it automaticallyPrint"' I love Python ':",' I love python 'Print"Repr ' I Love Python ':",Repr(' I love python ')# to see the problem above, we'll pass the ' I ' (str type) to the Unicode function, and the result is an error.Try:Print"Unicode (' Me '):",Unicode(I)ExceptUnicodedecodeerrorAsE:# An error has occurred and the parser is trying to decode our incoming parameters with ASCII encodingPrintE# The reason is that the parser first encodes the parameter in the default encoding format (here, utf-8) and then passes it to the Unicode () function,# The Help information for the Unicode function, which is said in paragraph:"Unicode (string[, encoding[, errors]), Unicode object|| Create a new Unicode object from the given encoded string.| Encoding defaults to the current default string encoding.‘‘‘# The Unicode class always decodes the first parameter with the encoding format specified by the second parameter, and if the second argument is empty, the default format is used.# The script starts with the UTF-8 encoding format, so the incoming ' I ' is automatically encoded with Utf-8.# However, Unicode is not decoded using the UTF-8 format we specified at the beginning, but ASCII, which of course will be an error.# Why the ASCII code, I guess the reason is that, Python 2.7.x in the parser is the default in ASCII as the default encoding format,# and the Utf-8 format we specify at the beginning of the file is valid only for string literals in this file, and Unicode classes are defined in other module files.# we used "Coding:utf-8" at the beginning of the file to specify the encoding format,# So the parser uses this format to decode the encoded strings that are encountered in this file.# If this encoded string is not encoded in UTF-8 format, an error will occur# with print output, this error is ignored, print blankPrint' GBK_U1: ',U1.Encode(' GBK ')Print' Repr gbk_u1 ',Repr(U1.Encode(' GBK '))Try:# But if you use the Decode () function to decode an incorrect encoded string, you'll get an error.PrintU1.Encode(' GBK ').Decode(' Utf-8 ')ExceptUnicodedecodeerrorAsE:PrintEPrint‘‘########################## More about encoded strings ########################### for the parser, the encoded string is nothing special, it's a string of type str# As you can understand, Unicode strings are encoded to get a string of type strPrint' Type of UTF8_U1: ',Type(Utf8_u1)# like the normal str string, the parser uses the default encoding to decode the hexadecimal escape characters in the STR stringPrintR "' \xe7\x88\xb1python ':",‘\xe7\x88\xb1Python '# You can also specify the encoding formatPrintR "' \xe7\x88\xb1python '",‘\xe7\x88\xb1Python '.Decode(' Utf-8 ')# The STR string is decoded to be a Unicode string, even if it is decoded with ASCIIPrint"Type of decoded ' a ':",A.Decode(' Utf-8 ')# We want characters like "\xe7" in the STR string to be escaped with "\"PrintR "'\\Xe7\\x88\\Xb1python ': ",‘\\Xe7\\x88\\Xb1python '# What if the word "\xe7" is included in the Unicode string?PrintR "U ' \xe7\x88\xb1python ':",U\xe7\x88\xb1Python '# The hexadecimal escape character in the encoded string represents a byte, and the ASCII character is also a byte,# so we can get the number of bytes occupied by the string by using the Len function to calculate the length of the encoded string encoded by the string.Print' Length of utf8_u1: ',Len(Utf8_u1)Print‘‘########################## Strings Connected ########################### str string can be connectedPrint' s + S: ',S+S# Unicode strings can also be connected directlyPrint' U + u1 + U2: ',U+U1+U2# when the encoded string is connected, the connection is first, and then the output is decoded together.Print' UTF8_U1 + utf8_u2: ',Utf8_u1+Utf8_u2Try:# when a Unicode string is connected to an encoded string, the system first decodes the encoded string in an encoded format.# If the encoded string cannot be decoded correctly, an error will bePrint' U + utf8_u1 + utf8_u2: ',U+Utf8_u1+Utf8_u2ExceptUnicodedecodeerrorAsE:PrintE# but using the STR character to connect to the coded string is not an error.# because the STR character's encoded character is the same as itself, adding directly to the encoded string does not break the original encoded string# The system will first multibyte the STR character into the encoded string and then decode it togetherPrint' s + utf8_u1 + utf8_u2: ',S+Utf8_u1+Utf8_u2# Encode strings are decoded and then concatenated with Unicode characters without errorPrint' Decoded s + utf8_u1 + utf8_u2: ',S+Utf8_u1.Decode(' Utf-8 ')+Utf8_u2.Decode(' Utf-8 ')# encoded strings in different encoding format are connected, will not be error (all str type), but the result of the connection is messed up, the system can not decodePrint' Repr gbk_u1 + utf8_u2: ',Repr(U1.Encode(' GBK ')+U2.Encode(' Utf-8 '))Print' GBK_U1 + utf8_u2: ',U1.Encode(' GBK ')+U2.Encode(' Utf-8 ')# Summarize the strings attached, the main point is:# The same type of direct connection; different types, first encode the Unicode type (get the STR type encoded string), and then connect.Print‘‘########################## Encoding of the file ##########################U=U ' I love python 'PrintUPrintRepr(U)# Create a sample file firstW=Open(' Demo.txt ',' W ')Try:# directly passing in a Unicode string may cause an errorW.Write(u) W.Close()exceptUnicodeencodeerror ase: # The reason for the error is that the write () method only accepts strings of type STR# The parser uses ASCII encoding to encode Unicode charactersPrintePrint"'# So we need to encode and write firstW.Write(u.encode(' Utf-8 '))W.Close()# We'll read it out again #R= Open(' Demo.txt ', ' R ')content= R.Read()R.Close()# by Repr and type checking, we can see the encoded string of the read or str type# without being converted into a UNICDE stringPrint' repr content: ', repr(content)Print' type of content: ', type(content)Print' content: ', content# Summary below:The # Open () function reads data from disk in bytes, and gets the text of str type,# If a byte is not a asccii code character, it is represented by an escaped hexadecimal (in fact, the encoded string we call). # text is processed in this format by the STR character in the resulting file object. This approach avoids unpredictable coding problems and decodes the encoding# The problem is left to the caller to resolve. # Therefore, we should pay attention to this when dealing with the text and other text in the program that we read from the file, always remembering:# The same type of direct connection; different types, first encode the Unicode type (get the STR type encoded string), and then connect. # So what if we have to deal with the Uncode string directly? # with codecs module

Coding and decoding of Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.