Python Crawler Coding problems

Source: Internet
Author: User
Tags utf8 decode cpu usage python script

Crawler, Novice is easy to encounter coding and decoding problems. summed up here.

If the problem of coding decoding is not good, the crawler light is garbled, heavy error unicodedecodeerror: ' xxxxxx ' codec can ' t decode byte 0xc6 in position 1034:invalid Continuati On byte, this xxx may be ASCII UTF8 GBK and so on.

We must choose a special time to learn this area, online resources a lot of. Because encoding and decoding age does not concern the program logic function in the overall situation, but almost every program will encounter this thing, so you have to devote time to learn to practice in order to avoid frequent and trouble.

1. First two URLs, choose a gb2312 and a UTF8 URL.

UTF8 website selection https://www.baidu.com Baidu

GB2312 's website selection http://www.autohome.com.cn/beijing/#pvareaid = 100519 Autohome

url1= ' https://www.baidu.com '
Url2= ' http://www.autohome.com.cn/beijing/#pvareaid = 100519 '
Contentx=requests.get (URL2). Content

Print Unicode (CONTENTX)
Print Contentx.encode (' GBK ')
Print Contentx.encode (' UTF8 ')

Let's talk about requesting URL1.

First on the six lines of code, if the request Url1, then in the py2.7.13 will not error, in the py2.7.12 will be error, but in the Pycharm console print results disorderly not garbled I do not guarantee that Pycharm settings have Project encoding options , if you set the UTF8 or GBK, then the second and the third one is bound to be garbled display. The same PY code if you set the PYCHAMR editor is UTF8, the result is normal display, then the cmd run Python xx.py see results that will inevitably garbled.


Above said is py2.7.13, if you are 2.7.12, then the result is not so, in 2.7.12, the above six lines of code will not be displayed garbled so simple, but direct error

Run the following sentence in the 2.7.12
Print Requests.get (URL1). Content.encode (' GBK ')
or run Unicode (Requests.get (URL1). Content)
Will prompt

Because the direct transfer from the string to another encoding format, the default encoding decode decoded, and then encoded in the specified encode format.


You can run this sentence in a Python script
Import Sys
Print sys.getdefaultencoding ()


2.7.13 's

2.7.12 's



py2.7.13 printing result is utf8, and py2.7.12 printing result is ascii,url1 page encoding is UTF8, with ASCII decoding error.

To make the py2.7.12 not report this error, then you can add the following sentence of the classic code

After joining, will not appear ASCII codec cant decode hint, if you do not join the above sentence, want to get the content encode to GBK format, you can do as follows

Print Requests.get (URL1). Content.encode (' GBK ') change this to print Requests.get (URL1).  Content.decode (' UTF8 '). Encode (' GBK '), so that it does not decode by default encoding, you can not need the above reload and setdefaultcoding.

Import Sys
Print sys.getdefaultencoding ()
Reload (SYS)
Sys.setdefaultencoding (' UTF8 ')

These three lines of code is not a balm, can not think of writing these words will not be a coding problem, if the request is URL2, Autohome Web page is GBK format,
If you use Print Requests.get (URL2). Content.encode (' UTF8 '), that will be the same error from the content string according to the specified UTF8 to decode to Unicode, it will be an error, this time you need   Sys.setdefaultencoding (' GBK '), not UTF8. 

But you should program both request URL1 and request Url2, then no matter how you specify the default, if not specifically specify content decode way, there will be an error.
So the best thing to do is to solve this problem.
Print Requests.get (URL1). Content.decode (' UTF8 '). Encode (' xxxx ')   XXX represents the way you want to encode
Print Requests.get (URL2). Content.decode (' GBK '). Encode (' xxxx ')

This practicality is very strong, whether you are 2.7.12 or 2.7.13, whether you write sys.setdefaultencoing and the default designation into what, there will be no coding problem.





Part Two: How to know the encoding of a webpage

Using 360 Browser Right button is convenient to have a code to see the function, you can try to change the GBK to UTF8 or UTF8 to GBK, then the browser will appear garbled. With 360 browser Right click, 99% of the case is the page encoding, I was more than 360 years rarely found 360 out of this garbled error.

In addition to browser view, there is a way to right-click to view the source code, is Baidu, you can see the use of UTF8, then the response content with UTF8 decode is not a problem, if the Web source charset is gb2312, with GBK DEOCDE can do it.


At present, I do the public opinion analysis, to crawl tens of thousands of Web site news, with a browser to see that is certainly not, into what the page is unknown, if it is directed to crawl, you can code to specify what format decode

You can use the following sentence to get the Web page encoding format re.findall (' <meta[\s\s]*?charset= '? *?) "', Content,re. IGNORECASE) [0]


Before colleagues introduced a code to get the package, called Chardet, the usage is chardet.detect (content), this method is very accurate, but the shortcomings are too large, long time consuming CPU computing, CPU usage is too high, the entire crawler speed is torn down. When the content of the page is larger, it is not good to use chardet, or even to detect a coding method that takes 15 seconds.


To see what Chardet is doing for so long, print out the logs.
Code stickers out, interested in this package can be seen.


#Coding=utf8ImportRequestsImportChardet,logginglogger=logging.getlogger ("') Logger.setlevel (logging. DEBUG) Stream_handler=logging. Streamhandler () Stream_handler.setlevel (logging. DEBUG) FMT=logging. Formatter ('% (asctime) s-% (name) s-% (levelname) s-% (message) s') Stream_handler.setformatter (FMT) Logger.addhandler (stream_handler) url1='https://www.baidu.com'Url2='http://www.autohome.com.cn/beijing/#pvareaid =100519'ContentX=Requests.get (URL2). Contentbianma=Chardet.detect (ContentX)PrintBianma

After reading this article, you should not encounter coding problems.



Python Crawler Coding problems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.