Character encoding in Python: encode/decode

Source: Internet
Author: User
Encoding of characters in Python. During encoding and conversion, we must be clear about the encoding method of the string we read, or the encoding method of the string we want to decode. Then we can use decode to decode the string, decodes a unicode string, and uses the expected encoding format to encode the decoded Unicode string. Note: The correct content is read, decoded, And the content after another encoding may also be garbled. The general form is str1, which indicates the string we read. Str1.decode ("GBK"). encode ("UTF-8") cannot encode the STR string because it is already in the encoded format !!! Unicode strings cannot be decoded because they are already decoded in some form !!! Bytes has two different types: byte string and Unicode string. Therefore, the encoding and decoding in Python are Unicode and byte mutual conversion. The encoding is Unicode-> byte, and the decoding is byte-> unicodehttp: // bytes. Strictly speaking, STR is actually a byte string, which is a sequence of unicode encoded bytes. Unicode is a true string. It is obtained after decoding the byte string 'str' using the correct character encoding. Decoding Unicode is incorrect. Encoding STR is also incorrect in the source code file. If non-ASCII characters are used, you must declare the character encoding in the file header. In fact, Python only checks the #, coding, and encoding strings. Other characters are added for the sake of beauty. In addition, there are many characters available in Python, and there are many alias, not case sensitive, such as UTF-8 can be written as u8. See http://docs.python.org/library/codecs.html?standard-encodings. Encode ('gbk'). encode ('utf-8') to convert. However, the following exception is encountered during conversion today: unicodedecodeerror: 'gbk' codec can't decode bytes in position 7-8: illegal multibyte sequence is caused by illegal characters. Especially in some programs written in C/C ++, full-angle spaces often have different implementation methods, for example, \ xa3 \ xa0, or \ Xa4 \ x57, these characters are all full-angle spaces, however, they are not "valid" fullwidth spaces (the real fullwidth space is \ xA1 \ xA1), so an exception occurs during transcoding. This problem is a headache, because as long as there is an invalid character in the string, the entire string-sometimes, the entire article-cannot be transcoded. Solution: S. decode ('gbk', 'ignore '). encode ('utf-8') because the prototype of the decode function is decode ([encoding], [errors = 'strict ']), you can use the second parameter to control the error handling policy, the default parameter is strict, which indicates that an exception is thrown when an invalid character is encountered. If it is set to ignore, the invalid character is ignored. If it is set to replace,? Replace invalid characters. If it is set to xmlcharrefreplace, It is referenced by the characters in XML. Http://blog.iamzsx.me/show.html? The id = 81001 pairs of files with Chinese characters must be in UTF-8 format. Encoding coding: UTF-8 is indicated at the beginning of the file, and all strings in the file are in unicode format. Use the encode and decode functions to convert the input data of the file or other input methods as needed. Also, use the decode function to convert the data to Unicode, which reduces the coding trouble.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.