Python's character conversion common bug

Source: Internet
Author: User

1.python writing a Unicode string to a file why is it an error?

The parameter type of the Write method is Str,str is a binary stream (does not contain encoded information), and when you give a Unicode object, the STR function is converted to the STR type and sent to the Write method. Unicode to STR contains one-time encoding, which, if not specified, uses ASCII encoding by default, and there is no correspondence between Chinese characters in the ASCII encoding set, so an error is given.

The correct approach is to specify the encoding in the code. For example, specify in open (fp= open (' Test.txt ', ' W ', encoding= ' utf-8 '), or manually encode the Unicode object via the Encode method to generate str at write time. It is written as Fp.write (S.encode (' UTF8 ')). Note that Unicode objects are meaningful with encode, and the Str object allows you to use encode for STR objects in Py2, but this is only valid for the case where default encoding is specified, so it is not recommended for beginners to encode directly to Str.

2.error:unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \u200e ' in position 43:illegal multibyte sequence

The root cause of the error of the original ' GBK ' codec can ' t encode is that, for the front, either

Titlehtml.decode ("UTF-8");

or Titlehtml.decode ("UTF-8", ' ignore ');

or Titlehtml.decode ("UTF-8", ' replace ');

All can get normal titleuni Unicode characters, and then for this Unicode character, need print out, because the local system is Win7 in CMD, the default codepage is CP936, that is, GBK encoding, Therefore, it is necessary to first encode the above Unicode Titleuni into GBK, and then display them in cmd, and then, because Titleuni contains some characters that cannot be displayed in GBK, the error "' GBK ' codec can ' t encode" is prompted at this point.

For this (class) issue:

(1) The problem occurs when the unicodeencodeerror–> description is Unicode encoding;

(2) ' GBK ' codec can ' t encode character–> description is an issue that occurs when encoding Unicode characters as GBK;

At this point, it is often most likely that the character of the Unicode type itself contains some characters that cannot be converted to GBK encoding.

The solution is:

Scenario 1:

When encoding Unicode characters, add the ignore parameter, ignoring characters that cannot be encoded, so that they can be encoded as GBK normally.

The corresponding code is:

Gbktypestr = Unicodetypestr.encode ("GBK", ' ignore ');

Scenario 2:

Or, convert it to GBK encoded superset GB18030 (that is, GBK is a subset of GB18030):

Gb18030typestr = Unicodetypestr.encode ("GB18030");

The corresponding resulting character is the encoding of the GB18030.

This article is from the "Small Stop" blog, please be sure to keep this source http://10541556.blog.51cto.com/10531556/1971510

Python character conversion common bug

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.