Crawler, Novice is easy to encounter coding and decoding problems. summed up here.
If the problem of coding decoding is not good, the crawler light is garbled, heavy error unicodedecodeerror: ' xxxxxx ' codec can ' t decode byte 0xc6 in position 1034:invalid Continuati On byte, this xxx may be ASCII UTF8 GBK and so on.
We must choose a special time to learn this area, online resources a lot of. Because encoding and decoding age does not concern the program logic function in the overall situation, but almost every program will encounter this thing, so you have to devote time to learn to practice in order to avoid frequent and trouble.
1. First two URLs, choose a gb2312 and a UTF8 URL.
UTF8 website selection https://www.baidu.com Baidu
GB2312 's website selection http://www.autohome.com.cn/beijing/#pvareaid = 100519 Autohome
url1= ' https://www.baidu.com '
Url2= ' http://www.autohome.com.cn/beijing/#pvareaid = 100519 '
Contentx=requests.get (URL2). Content
Print Unicode (CONTENTX)
Print Contentx.encode (' GBK ')
Print Contentx.encode (' UTF8 ')
Let's talk about requesting URL1.
First on the six lines of code, if the request Url1, then in the py2.7.13 will not error, in the py2.7.12 will be error, but in the Pycharm console print results disorderly not garbled I do not guarantee that Pycharm settings have Project encoding options , if you set the UTF8 or GBK, then the second and the third one is bound to be garbled display. The same PY code if you set the PYCHAMR editor is UTF8, the result is normal display, then the cmd run Python xx.py see results that will inevitably garbled.
Above said is py2.7.13, if you are 2.7.12, then the result is not so, in 2.7.12, the above six lines of code will not be displayed garbled so simple, but direct error
Run the following sentence in the 2.7.12
Print Requests.get (URL1). Content.encode (' GBK ')
or run Unicode (Requests.get (URL1). Content)
Will prompt
Because the direct transfer from the string to another encoding format, the default encoding decode decoded, and then encoded in the specified encode format.
You can run this sentence in a Python script
Import Sys
Print sys.getdefaultencoding ()
2.7.13 's
2.7.12 's
py2.7.13 printing result is utf8, and py2.7.12 printing result is ascii,url1 page encoding is UTF8, with ASCII decoding error.
To make the py2.7.12 not report this error, then you can add the following sentence of the classic code
After joining, will not appear ASCII codec cant decode hint, if you do not join the above sentence, want to get the content encode to GBK format, you can do as follows
Print Requests.get (URL1). Content.encode (' GBK ') change this to print Requests.get (URL1). Content.decode (' UTF8 '). Encode (' GBK '), so that it does not decode by default encoding, you can not need the above reload and setdefaultcoding.
Import Sys
Print sys.getdefaultencoding ()
Reload (SYS)
Sys.setdefaultencoding (' UTF8 ')
These three lines of code is not a balm, can not think of writing these words will not be a coding problem, if the request is URL2, Autohome Web page is GBK format,
If you use Print Requests.get (URL2). Content.encode (' UTF8 '), that will be the same error from the content string according to the specified UTF8 to decode to Unicode, it will be an error, this time you need Sys.setdefaultencoding (' GBK '), not UTF8.
But you should program both request URL1 and request Url2, then no matter how you specify the default, if not specifically specify content decode way, there will be an error.
So the best thing to do is to solve this problem.
Print Requests.get (URL1). Content.decode (' UTF8 '). Encode (' xxxx ') XXX represents the way you want to encode
Print Requests.get (URL2). Content.decode (' GBK '). Encode (' xxxx ')
This practicality is very strong, whether you are 2.7.12 or 2.7.13, whether you write sys.setdefaultencoing and the default designation into what, there will be no coding problem.
Part Two: How to know the encoding of a webpage
Using 360 Browser Right button is convenient to have a code to see the function, you can try to change the GBK to UTF8 or UTF8 to GBK, then the browser will appear garbled. With 360 browser Right click, 99% of the case is the page encoding, I was more than 360 years rarely found 360 out of this garbled error.
In addition to browser view, there is a way to right-click to view the source code, is Baidu, you can see the use of UTF8, then the response content with UTF8 decode is not a problem, if the Web source charset is gb2312, with GBK DEOCDE can do it.
At present, I do the public opinion analysis, to crawl tens of thousands of Web site news, with a browser to see that is certainly not, into what the page is unknown, if it is directed to crawl, you can code to specify what format decode
You can use the following sentence to get the Web page encoding format re.findall (' <meta[\s\s]*?charset= '? *?) "', Content,re. IGNORECASE) [0]
Before colleagues introduced a code to get the package, called Chardet, the usage is chardet.detect (content), this method is very accurate, but the shortcomings are too large, long time consuming CPU computing, CPU usage is too high, the entire crawler speed is torn down. When the content of the page is larger, it is not good to use chardet, or even to detect a coding method that takes 15 seconds.
To see what Chardet is doing for so long, print out the logs.
Code stickers out, interested in this package can be seen.
#Coding=utf8ImportRequestsImportChardet,logginglogger=logging.getlogger ("') Logger.setlevel (logging. DEBUG) Stream_handler=logging. Streamhandler () Stream_handler.setlevel (logging. DEBUG) FMT=logging. Formatter ('% (asctime) s-% (name) s-% (levelname) s-% (message) s') Stream_handler.setformatter (FMT) Logger.addhandler (stream_handler) url1='https://www.baidu.com'Url2='http://www.autohome.com.cn/beijing/#pvareaid =100519'ContentX=Requests.get (URL2). Contentbianma=Chardet.detect (ContentX)PrintBianma
After reading this article, you should not encounter coding problems.
Python Crawler Coding problems