International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python Crawler Coding problems

Last Update:2017-07-23 Source: Internet

Author: User

Tags utf8 decode cpu usage python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawler, Novice is easy to encounter coding and decoding problems. summed up here.

If the problem of coding decoding is not good, the crawler light is garbled, heavy error unicodedecodeerror: ' xxxxxx ' codec can ' t decode byte 0xc6 in position 1034:invalid Continuati On byte, this xxx may be ASCII UTF8 GBK and so on.

We must choose a special time to learn this area, online resources a lot of. Because encoding and decoding age does not concern the program logic function in the overall situation, but almost every program will encounter this thing, so you have to devote time to learn to practice in order to avoid frequent and trouble.

1. First two URLs, choose a gb2312 and a UTF8 URL.

UTF8 website selection https://www.baidu.com Baidu

GB2312 's website selection http://www.autohome.com.cn/beijing/#pvareaid = 100519 Autohome

url1= ' https://www.baidu.com '
Url2= ' http://www.autohome.com.cn/beijing/#pvareaid = 100519 '
Contentx=requests.get (URL2). Content

Print Unicode (CONTENTX)

Print Contentx.encode (' GBK ')

Print Contentx.encode (' UTF8 ')


Let's talk about requesting URL1.

First on the six lines of code, if the request Url1, then in the py2.7.13 will not error, in the py2.7.12 will be error, but in the Pycharm console print results disorderly not garbled I do not guarantee that Pycharm settings have Project encoding options , if you set the UTF8 or GBK, then the second and the third one is bound to be garbled display. The same PY code if you set the PYCHAMR editor is UTF8, the result is normal display, then the cmd run Python  xx.py see results that will inevitably garbled.


Above said is py2.7.13, if you are 2.7.12, then the result is not so, in 2.7.12, the above six lines of code will not be displayed garbled so simple, but direct error

Run the following sentence in the 2.7.12

Print Requests.get (URL1). Content.encode (' GBK ')
or run Unicode (Requests.get (URL1). Content)
Will prompt

Because the direct transfer from the string to another encoding format, the default encoding decode decoded, and then encoded in the specified encode format.


You can run this sentence in a Python script

Import Sys
Print sys.getdefaultencoding ()


2.7.13 's

2.7.12 's



py2.7.13 printing result is utf8, and py2.7.12 printing result is ascii,url1 page encoding is UTF8, with ASCII decoding error.

To make the py2.7.12 not report this error, then you can add the following sentence of the classic code

After joining, will not appear ASCII codec cant decode hint, if you do not join the above sentence, want to get the content encode to GBK format, you can do as follows

Print Requests.get (URL1). Content.encode (' GBK ') change this to print Requests.get (URL1).  Content.decode (' UTF8 '). Encode (' GBK '), so that it does not decode by default encoding, you can not need the above reload and setdefaultcoding.

Import Sys
Print sys.getdefaultencoding ()
Reload (SYS)
Sys.setdefaultencoding (' UTF8 ')

These three lines of code is not a balm, can not think of writing these words will not be a coding problem, if the request is URL2, Autohome Web page is GBK format,

If you use Print Requests.get (URL2). Content.encode (' UTF8 '), that will be the same error from the content string according to the specified UTF8 to decode to Unicode, it will be an error, this time you need   Sys.setdefaultencoding (' GBK '), not UTF8.


But you should program both request URL1 and request Url2, then no matter how you specify the default, if not specifically specify content decode way, there will be an error.
So the best thing to do is to solve this problem.

Print Requests.get (URL1). Content.decode (' UTF8 '). Encode (' xxxx ')   XXX represents the way you want to encode

Print Requests.get (URL2). Content.decode (' GBK '). Encode (' xxxx ')

This practicality is very strong, whether you are 2.7.12 or 2.7.13, whether you write sys.setdefaultencoing and the default designation into what, there will be no coding problem.





Part Two: How to know the encoding of a webpage

Using 360 Browser Right button is convenient to have a code to see the function, you can try to change the GBK to UTF8 or UTF8 to GBK, then the browser will appear garbled. With 360 browser Right click, 99% of the case is the page encoding, I was more than 360 years rarely found 360 out of this garbled error.

In addition to browser view, there is a way to right-click to view the source code, is Baidu, you can see the use of UTF8, then the response content with UTF8 decode is not a problem, if the Web source charset is gb2312, with GBK DEOCDE can do it.


At present, I do the public opinion analysis, to crawl tens of thousands of Web site news, with a browser to see that is certainly not, into what the page is unknown, if it is directed to crawl, you can code to specify what format decode

You can use the following sentence to get the Web page encoding format re.findall (' <meta[\s\s]*?charset= '? *?) "', Content,re. IGNORECASE) [0]


Before colleagues introduced a code to get the package, called Chardet, the usage is chardet.detect (content), this method is very accurate, but the shortcomings are too large, long time consuming CPU computing, CPU usage is too high, the entire crawler speed is torn down. When the content of the page is larger, it is not good to use chardet, or even to detect a coding method that takes 15 seconds.


To see what Chardet is doing for so long, print out the logs.
Code stickers out, interested in this package can be seen.

#Coding=utf8ImportRequestsImportChardet,logginglogger=logging.getlogger ("') Logger.setlevel (logging. DEBUG) Stream_handler=logging. Streamhandler () Stream_handler.setlevel (logging. DEBUG) FMT=logging. Formatter ('% (asctime) s-% (name) s-% (levelname) s-% (message) s') Stream_handler.setformatter (FMT) Logger.addhandler (stream_handler) url1='https://www.baidu.com'Url2='http://www.autohome.com.cn/beijing/#pvareaid =100519'ContentX=Requests.get (URL2). Contentbianma=Chardet.detect (ContentX)PrintBianma

After reading this article, you should not encounter coding problems.

Python Crawler Coding problems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python web crawler code python web crawler tutorial python crawler tutorial python web crawler source code web crawler in python pdf how to write crawler in python python coding standards

The difference between OS and sys two modules in Python 04-05

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Coding problems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support