Research on the coding problem of Python------using the Scrapy experience

Source: Internet
Author: User

Python transcoding decoding

Research on the coding problem of Python------using the Scrapy experience based on Python2scrapy is a very lightweight crawler framework, but because it hides too much detail about network requests, we sometimes encounter awkward bugs, Of course, this is mainly because of encountering some irregular websites.

Python code transcoding Online has a lot of articles, if you do not understand this you can refer to below to understand.

Ned Batchelder's understanding of Python Unicode and Str is easy to understand

About Scrapy Getting Started

About Encode's knowledge

Through the above we can well understand the Python transcoding decoding, here I would like to talk about my own understanding of it, I started to contact the C language sequence is basically strong type, such as C if I want to write a function each descendant of the parameters have to have a type, but Python weakened the type of this point , Python is also facing the object, but his object is're same page, Tiger can run, weak type suitable for dynamic language, we are not sure what the next line of code input, since learning Python, always feel that Python is not strict with the type, which gives me an illusion, As long as you can get the same kind of gestures, for example, in two string,,, ‘中国‘ u‘中国‘ it looks pretty much, but if you put the u‘中国‘ file into error (if you do not define the encoding rules) UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 344-351: ordinal not in range(128) Unicode character encoding error, To understand this. To understand the Unicode character set and Unicode encoding, it is recommended that you read the knowledge of this blog character encoding, Python uses the Unicode character set to store so encoded characters, why use the Unicode character set to raise a chestnut bar:

A is a America programmer, he used ASICC encoded file upload an email, b is a Chinese programmer he used GBK encoded file upload an email, now c to use the program to process both A and B mail, there are two solutions he put A's file decoding and then encoded into the GBK of B, or the B file decoding to ASICC but the Chinese can not be processed, then can only use the first method to encode a file into GBK, but another day D came again, he is Russian, God Ah gbk may not have Russian in, that swollen do, we urgently need a code can put so the character in, So Unicode appears, Unicode divides the character set according to a certain category to 0~16 17 levels (Planes), each level has 216 = 65,536 character codes, so Unicode has a total of character codes, That is, the Unicode character space Total has 17*65536=1114112, a total of 1114112 so many characters can be used, so we do not have to worry about it, too good to have no worries,

Python internally uses the Unicode character set as a decoding broker because he encodes the character set so that if you can find your own word on your coding scheme, I can find your location in the Unicode character set. So the use of Unicode can be a good solution to the problem of a variety of coding schemes (such as gbk,utf-8) of course other coding schemes if you want to use Unicode decoding into the other must have one by one correspondence with Unicode, but now the mainstream encoding scheme such as gbk,gb2312, Utf-8 are all Unicode-series.

Knowing the basics, you can see why the storage u‘中国‘ is not in the file, because Unicode does not provide the method of the current character parser, is \u234e a 16 binary number, the screen does not know what he corresponds to the graphics, Therefore, the Python system requires that the file must be stored in a byte stream, that is, Unicode is a more advanced character stream, the character stream can be stored in today's world so defined characters, but he is only a set of character set, we just need to put the found characters in a position, But we do not need to consider whether the screen knows this character, the storage of this character is responsible for the encoding scheme, such as Utf-8 these, if there is no character encoding scheme can be stored these, although we have this character on Unicode but we can not print out, So we have to convert Unicode into a normal character stream, and someone would ask, if I really didn't find a suitable coding scheme to store all the languages, we could encode him into a unicode-escape type, which we don't speak much about.

This can explain that most of the errors we encountered Unicodedecodeerror and unicodeencodeerror errors, because the character encoding scheme is not known, many online said that encountered this error encode, Decode, just do it. But not knowing the knowledge behind this will make you confused.

Next I'll talk about the error I encountered, when crawling http://yjsy.ncu.edu.cn/yjs_showmsg.asp?id=2770 this page (this is an irregular page without setting charset), because each spider calls the

    

The selector returned a Unicode-encoded character set, but he accepted a character stream, and the spider might have called the response.body.decode(response.encoding) transcoding, But this response.encoding can sometimes judge the error, such as my GBK encoded file to judge cp1253, this time if I decode him into the encode into other coding, we will get garbled, then how to correct it, we can do this Take each of the resulting lists content out and then use transcoding to content.encode(resonse.encoding) stream the original character, and now you can convert it to Unicode with the correct encoding.

Here is my GitHub on the project on this scrapy, in the coding_pitch.py file is the handling of this garbled

Nanchang University Office of academic Affairs notice crawl

Research on the coding problem of Python------using the Scrapy experience

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.