Research on python coding ------ using scrapy experience, pythonscrapy

Source: Internet
Author: User

Research on python coding ------ using scrapy experience, pythonscrapy

 

Python transcoding and Decoding

Research on python coding ------ using scrapy to experience python2scrapy is a very lightweight crawler framework, but because it hides too many details about network requests, so sometimes we encounter a very embarrassing bug. Of course, this is mainly because of some nonstandard websites.

There are many articles on python coding and transcoding online. If you do not know this, you can refer to the following.

Ned Batchelder's understanding of python unicode and str is easy to understand.

About scrapy

About encode

Through the above, we can have a good understanding of python's transcoding decoding. Here I want to talk about my own understanding of it. The C language sequence I first came into contact with was basically strongly typed, for example, in C, if I want to write a function, the parameters of each transmitter must be of a type, but python weakens the type, and python also faces objects, however, when talking to the duck, the hacker can run as a cat or a tiger, and the weak type is suitable for dynamic languages. We are not sure what the next line of code is, since learning python, I have always felt that python is not strict with the type, which gives me the illusion that, as long as it looks like, it can be the same, for example, in two strings,'China',U'china', It looks similar, but if youU'china'An error occurs when saving the file (if you do not define the encoding rules)UnicodeEncodeError: 'ascii' codec can't encode characters in position 344-351: ordinal not in range(128)Unicode character encoding error. If you want to understand the unicode Character Set and unicode encoding, read this blog's character encoding knowledge, python uses the unicode Character Set to store the encoded characters. Why should we use the unicode Character Set for example:

A is A programmer in miguo. He uploads an email using an asicc-encoded file. B is A programmer in China. He uploads an email using A gbk-encoded file, now C is going to use A program to process mails A and B at the same time. There are two solutions: Decoding the file A and then encoding it into gbk of B, or decoding B's file into asicc, but it cannot be processed in Chinese, so we can only use the first method to encode A's file into gbk, but another day D is coming, he is Russian, oh, my God, the gbk may not have compiled the Russian language. It's so swollen. We urgently need a kind of encoding to put so many characters in it, so unicode has emerged, in Unicode, the character set is divided into 0 to a certain category ~ 16 among the 17 levels (Planes), each layer has 216 = 65536 encoding codes, so Unicode has a total of encoding codes, that is, the Unicode character space has a total of 17*65536 = 1114112 characters, a total of 1114112 so many characters can be used, so we don't have to worry about it, that's great,

Python uses the unicode character set as a decoding Transfer Station. Because it is encoded, as long as you can find your own words in your own encoding scheme, I can find your location in the unicode Character Set, so using unicode can solve problems caused by multiple encoding schemes (such as gbk and UTF-8) of course, if you want to use unicode for decoding Other encoding schemes, they must have a one-to-one correspondence with unicode. However, the mainstream encoding schemes such as gbk, gb2312, and UTF-8 are unicode.

After understanding the basic knowledge, you can understand why to storeU'china'This is because unicode is not provided to the character parser.\u234eA hexadecimal number does not know the corresponding graphics on the screen. Therefore, the python system requires that the file be saved in a byte stream, that is, unicode is a more advanced compaction stream, this character stream can store the characters defined in today's world, but it is only a set of character sets. We only need to put the discovered characters in a single position, however, we do not need to consider whether the screen knows this character. The storage of this character is the responsibility of the encoding scheme, such as UTF-8. If there is no character encoding scheme, although we have this character in unicode, we cannot print it out, so we must convert unicode to a normal sequence stream, and someone will ask, if I did not find a suitable encoding scheme to store all languages, we can encode it into the unicode-escape type, which we will not talk about here.

This explains most of the errors we encounter: unicodedecodeerror and unicodeencodeerror. They are all caused by the ignorance of the character encoding solution. On the Internet, many people say that when such errors are encountered, encode, decode is just a bit of work, but if you don't understand the knowledge behind it, you will be confused.

Next, let's talk about my mistakes.http://yjsy.ncu.edu.cn/yjs_showmsg.asp?id=2770This page (charset is not set for this nonstandard page), because each spider calls

    response.xpath('//xpath').extract() 

The selector returns a unicode Character Set, But it accepts a response stream, which may be called by the spider.response.body.decode(response.encoding)Transcoding, but this response. encoding sometimes judges errors. For example, if a gbk-encoded file is identified as cp1253, we will get garbled characters if I decode it into an encode as another encoding method, so how can we correct it? In this way, we can firstcontentAnd then usecontent.encode(resonse.encoding)Transcoding to the original transcoding stream. Now you can convert it to unicode using the correct encoding.

The following is my scrapy project on github. In the coding_pitch.py file, we will handle this garbled code.

Nanchang University Academic Affairs Office announcement crawling

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.