Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \u200e ' in position 43:illegal multibyte sequence

Source: Internet
Author: User

Problem

A Web page has been obtained in Python:

http://blog.csdn.net/hfahe/article/details/5494895

The HTML source code, its time UTF-8 encoded.

Extract its title section:

        <span class= "Link_title" ><a href= "/hfahe/article/details/5494895" >        in Beijing Perl Speech at the conference-using Mason to develop high-performance Web sites         </a></span>

Title text in:

Speech at the Beijing Perl Conference-using Mason to develop high-performance Web sites

Then use:

Titleuni = Unicode (titlehtml, "Utf-8″");

Or

Titleuni = Titlehtml.decode ("Utf-8″");

Decodes it to Unicode, but it makes an error:

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \u200e ' in position 43:illegal multibyte sequence

"Resolution Process"

1.Python coding problem, Gb18030,utf-8,unicode and other problems, encountered many times before, also solved. What's strange here is that

Similar to other web pages, such as:

http://blog.csdn.net/v_july_v/article/details/6543438

http://blog.csdn.net/v_july_v/article/details/5934051

And so on, the corresponding extracted content, all can be normal decoding to Unicode.

Because the code itself is really Utf-8.

2. To try the Chardet.detect analysis of its true code, the result is:

encinfo= {' confidence ': 0.99, ' encoding ': ' utf-8′}

Is the same as the result of other Web content.

3. This problem is very strange is that itself called UTF-8 to decode, but the decoding error is prompted GBK, rather than UTF-8 related decoding error.

4. Some other posts were found:

Python Unicodeencodeerror:illegal Multibyte sequence

But the discussion is about encoding from Unicode to GBK or GB2312, and then making an error.

And my error here is that the content itself is UTF-8, and then want to revert to Unicode, the result is to prompt GBK decoding the wrong ...

5. Here: Explore UTF-8 Chinese code BOM tag problem mentioned, may be due to the UTF-8 BOM caused by the normal decoding, so try to export the returned HTML as an HTML file, and then use the notepad++ view, the results still do not see whether there is a BOM, Anyway, it's text content, you can see it.

And then we tried something like this code:

Titleuni = Titlehtml[1:].decode ("Utf-8″");

Titleuni = Titlehtml[2:].decode ("Utf-8″");

But it's still not working.

Later here also saw, about the UTF-8 BOM problem explanation, but also not I want.

6. Here: Python Unicode and Chinese Processing (digest), see:

S.decode (' GBK ', ' ignore '). Encode (' utf-8′ ')

It was then remembered that a similar explanation had been seen before, namely adding ignore to ignore illegal characters and then referencing:

An issue with illegal characters encountered in Python string decode

Then went to find the corresponding syntax:

str. Decode ([encoding[, errors]])

Decodes the string using the codec registered for encoding. Encoding defaults to the default string encoding. Errors May is given to set a different error handling scheme. The default is ' strict ', meaning that encoding errors raiseunicodeerror. Other possible values is ' Ignore ', ' replace ' and any other name registered via Codecs.register_erro R (), see section Codec Base Classes.

New in version 2.2.

Changed in version 2.3:support for other error handling schemes added.

Changed in version 2.7:support for keyword arguments added.

Try it:

Titleuni = Titlehtml.decode ("Utf-8″, ' ignore ');

And:

Titleuni = Titlehtml.decode ("Utf-8″, ' replace ');

But the result is still:

Print "titleuni=", Titleuni;

The above "' GBK ' codec can ' t encode" error will appear.

But then inadvertently discovered that before printing Titleuni, a line of debugging code was added:

Print "Len (titleuni) =", Len (Titleuni);

Can be printed normally, this means that the Titleuni variable here, normal decoding is the value of Unicode, that is, the above decode is normal.

And then try again, before:

Titleuni = Titlehtml.decode ("Utf-8″");

The result is the same, that is, print "Len (titleuni) =", Len (Titleuni), and can be output normally.

Then only then understand that the original appearance of ' GBK ' codec can ' t encode "the root cause of the error is, for the front, whether it is used

Titlehtml.decode ("Utf-8″");

Still is

Titlehtml.decode ("Utf-8″, ' ignore ');

Still is

Titlehtml.decode ("Utf-8″, ' replace ');

All can get normal titleuni Unicode characters, and then for this Unicode character, need print out, because the local system is Win7 in CMD, the default codepage is CP936, that is, GBK encoding, Therefore, it is necessary to first encode the above Unicode Titleuni into GBK, and then display them in cmd, and then, because Titleuni contains some characters that cannot be displayed in GBK, the error "' GBK ' codec can ' t encode" is prompted at this point.

Summary

For this (class) issue:

(1) The problem occurs when the unicodeencodeerror–> description is Unicode encoding;

(2) ' GBK ' codec can ' t encode character–> description is an issue that occurs when encoding Unicode characters as GBK;

At this point, it is often most likely that the character of the Unicode type itself contains some characters that cannot be converted to GBK encoding.

The solution is:

    • Scenario 1:

When encoding Unicode characters, add the ignore parameter, ignoring characters that cannot be encoded, so that they can be encoded as GBK normally.

The corresponding code is:

gbktypestr = Unicodetypestr.encode ("GBK", ' ignore ');
    • Scenario 2:

Or, convert it to GBK encoded superset GB18030 (that is, GBK is a subset of GB18030):

gb18030typestr = Unicodetypestr.encode ("GB18030");

The corresponding resulting character is the encoding of the GB18030.

"Off-topic"

For the above, it is safer to convert the original utf-8 characters to Unicode, or you can:

Titleuni = Titlehtml.decode ("Utf-8″");

To be replaced by:

Titleuni = Titlehtml.decode ("Utf-8″, ' ignore ');

This can be achieved, even for those, relatively insignificant some of the special characters, can also be successfully encoded, to avoid coding errors, improve the robustness of the program.

"PostScript 2012-12-01"

Later, devoted to devoting energy, summed up some of the most common types, interested can go to see:

"Summary" errors in the coding and decoding of common characters in Python 2.x and their solutions

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \u200e ' in position 43:illegal multibyte sequence

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.