Talking about the coding process of Python crawling web pages, talking about python crawling code

Source: Internet
Author: User

Talking about the coding process of Python crawling web pages, talking about python crawling code

Background

During the mid-autumn festival, A friend sent me an email saying that when he was crawling his house, he found that the Code returned from the webpage was garbled and asked me to help his adviser (working overtime during the Mid-Autumn Festival, really dedicated = !), In fact, I have encountered this problem for a long time. I read it a little before when I was crawling a novel, but I didn't take it seriously. In fact, this problem is caused by a poor understanding of coding.

Problem

A common crawler code is as follows:

# ecoding=utf-8import reimport requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.text

The purpose is actually very simple, it is to crawl the contents of the chain house. However, after this execution, all the returned results involving Chinese content will become garbled, such

<script type="text/template" id="newAddHouseTpl"> <div class="newAddHouse">  自从您上次浏览(<%=time%>ï¼‰ä¹‹åŽï¼Œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåŠ äº†<%=count%>套房源  <a href="<%=url%>" class="LOGNEWERSHOUFANGSHOW" <%=logText%>><%=linkText%></a>  <span class="newHouseRightClose">x</span> </div></script>

Such data is useless.

Problem Analysis

The problem here is obvious, that is, the text encoding is incorrect, leading to garbled characters.

View the webpage code

From the perspective of the header of the target webpage, the webpage is encoded with UTF-8.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Therefore, the final encoding must also be handled using UTF-8. That is to say, the final text processing should be decoded using UTF-8, that is, decode ('utf-8 ')

Text encoding and decoding

The Python encoding and decoding process is like this. The source file ===, encode (encoding method) ===, and decode (decoding method) are not recommended to a large extent.

import sysreload(sys)sys.setdefaultencoding('utf8')

This method is used to hard process text encoding. However, in some cases, it is not a big problem to be lazy. However, it is recommended that you use encode and decode to process the text after obtaining the source file.

Back to question

The biggest problem now is the source file encoding method. When we use requests normally, it will automatically guess the source file encoding method and then transcode it to Unicode encoding. However, after all, it is a program, it is possible to guess wrong, so if we guess wrong, We need to manually specify the encoding method. The official documents are described as follows:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. the text encoding guessed by Requests is used when you access r. text. you can find out what encoding Requests is using, and change it, using the r. encoding property.

So we need to check the encoding method returned by requests?

# ecoding=utf-8import reimport requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.encoding

The output is as follows:

ISO-8859-1

That is to say, the source file uses a ISO-8859-1 for encoding. Baidu ISO-8859-1, the results are as follows:

ISO8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages.

Problem Solving

Once this problem is found, the problem is well solved. You only need to specify the encoding to correctly output Chinese characters. The Code is as follows:

# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)res.encoding = ('utf8')print res.text

The printed results are obvious, and the text is correctly displayed.

Another method is decoding and encoding the source file. The Code is as follows:

# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)# res.encoding = ('utf8')print res.text.encode('ISO-8859-1').decode('utf-8')

Another: ISO-8859-1 is also called latin1, using latin1 as the decoding result is also normal.

Many things can be said about character encoding. For more information, see the following documents.

• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The above discussion about the encoding of Python crawling web pages is all the content shared by the editor. I hope you can give us a reference and support the help house.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.