Talking about the coding process of Python crawling web pages, talking about python crawling code
Background
During the mid-autumn festival, A friend sent me an email saying that when he was crawling his house, he found that the Code returned from the webpage was garbled and asked me to help his adviser (working overtime during the Mid-Autumn Festival, really dedicated = !), In fact, I have encountered this problem for a long time. I read it a little before when I was crawling a novel, but I didn't take it seriously. In fact, this problem is caused by a poor understanding of coding.
Problem
A common crawler code is as follows:
# ecoding=utf-8import reimport requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.text
The purpose is actually very simple, it is to crawl the contents of the chain house. However, after this execution, all the returned results involving Chinese content will become garbled, such
<script type="text/template" id="newAddHouseTpl"> <div class="newAddHouse"> 自从您上次æµè§ˆï¼ˆ<%=time%>)之åŽï¼Œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåŠ 了<%=count%>å¥—æˆ¿æº <a href="<%=url%>" class="LOGNEWERSHOUFANGSHOW" <%=logText%>><%=linkText%></a> <span class="newHouseRightClose">x</span> </div></script>
Such data is useless.
Problem Analysis
The problem here is obvious, that is, the text encoding is incorrect, leading to garbled characters.
View the webpage code
From the perspective of the header of the target webpage, the webpage is encoded with UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Therefore, the final encoding must also be handled using UTF-8. That is to say, the final text processing should be decoded using UTF-8, that is, decode ('utf-8 ')
Text encoding and decoding
The Python encoding and decoding process is like this. The source file ===, encode (encoding method) ===, and decode (decoding method) are not recommended to a large extent.
import sysreload(sys)sys.setdefaultencoding('utf8')
This method is used to hard process text encoding. However, in some cases, it is not a big problem to be lazy. However, it is recommended that you use encode and decode to process the text after obtaining the source file.
Back to question
The biggest problem now is the source file encoding method. When we use requests normally, it will automatically guess the source file encoding method and then transcode it to Unicode encoding. However, after all, it is a program, it is possible to guess wrong, so if we guess wrong, We need to manually specify the encoding method. The official documents are described as follows:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. the text encoding guessed by Requests is used when you access r. text. you can find out what encoding Requests is using, and change it, using the r. encoding property.
So we need to check the encoding method returned by requests?
# ecoding=utf-8import reimport requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.encoding
The output is as follows:
ISO-8859-1
That is to say, the source file uses a ISO-8859-1 for encoding. Baidu ISO-8859-1, the results are as follows:
ISO8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages.
Problem Solving
Once this problem is found, the problem is well solved. You only need to specify the encoding to correctly output Chinese characters. The Code is as follows:
# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)res.encoding = ('utf8')print res.text
The printed results are obvious, and the text is correctly displayed.
Another method is decoding and encoding the source file. The Code is as follows:
# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)# res.encoding = ('utf8')print res.text.encode('ISO-8859-1').decode('utf-8')
Another: ISO-8859-1 is also called latin1, using latin1 as the decoding result is also normal.
Many things can be said about character encoding. For more information, see the following documents.
• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The above discussion about the encoding of Python crawling web pages is all the content shared by the editor. I hope you can give us a reference and support the help house.