Talking about the coding process of Python crawling web pages, talking about python crawling code

Last Update:2016-11-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

During the mid-autumn festival, A friend sent me an email saying that when he was crawling his house, he found that the Code returned from the webpage was garbled and asked me to help his adviser (working overtime during the Mid-Autumn Festival, really dedicated = !), In fact, I have encountered this problem for a long time. I read it a little before when I was crawling a novel, but I didn't take it seriously. In fact, this problem is caused by a poor understanding of coding.

Problem

A common crawler code is as follows:

# ecoding=utf-8import reimport requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.text

The purpose is actually very simple, it is to crawl the contents of the chain house. However, after this execution, all the returned results involving Chinese content will become garbled, such

<script type="text/template" id="newAddHouseTpl"> <div class="newAddHouse">  è‡ªä»Žæ‚¨ä¸Šæ¬¡æµè§ˆï¼ˆ<%=time%>ï¼‰ä¹‹åŽï¼Œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåŠ äº†<%=count%>å¥—æˆ¿æº  <a href="<%=url%>" class="LOGNEWERSHOUFANGSHOW" <%=logText%>><%=linkText%></a>  <span class="newHouseRightClose">x</span> </div></script>

Such data is useless.

Problem Analysis

The problem here is obvious, that is, the text encoding is incorrect, leading to garbled characters.

View the webpage code

From the perspective of the header of the target webpage, the webpage is encoded with UTF-8.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Therefore, the final encoding must also be handled using UTF-8. That is to say, the final text processing should be decoded using UTF-8, that is, decode ('utf-8 ')

Text encoding and decoding

The Python encoding and decoding process is like this. The source file ===, encode (encoding method) ===, and decode (decoding method) are not recommended to a large extent.

import sysreload(sys)sys.setdefaultencoding('utf8')

This method is used to hard process text encoding. However, in some cases, it is not a big problem to be lazy. However, it is recommended that you use encode and decode to process the text after obtaining the source file.

Back to question

The biggest problem now is the source file encoding method. When we use requests normally, it will automatically guess the source file encoding method and then transcode it to Unicode encoding. However, after all, it is a program, it is possible to guess wrong, so if we guess wrong, We need to manually specify the encoding method. The official documents are described as follows:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. the text encoding guessed by Requests is used when you access r. text. you can find out what encoding Requests is using, and change it, using the r. encoding property.

So we need to check the encoding method returned by requests?

# ecoding=utf-8import reimport requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)print res.encoding

The output is as follows:

ISO-8859-1

That is to say, the source file uses a ISO-8859-1 for encoding. Baidu ISO-8859-1, the results are as follows:

ISO8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages.

Problem Solving

Once this problem is found, the problem is well solved. You only need to specify the encoding to correctly output Chinese characters. The Code is as follows:

# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)res.encoding = ('utf8')print res.text

The printed results are obvious, and the text is correctly displayed.

Another method is decoding and encoding the source file. The Code is as follows:

# ecoding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf8')url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'res = requests.get(url)# res.encoding = ('utf8')print res.text.encode('ISO-8859-1').decode('utf-8')

Another: ISO-8859-1 is also called latin1, using latin1 as the decoding result is also normal.

Many things can be said about character encoding. For more information, see the following documents.

• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The above discussion about the encoding of Python crawling web pages is all the content shared by the editor. I hope you can give us a reference and support the help house.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Talking about the coding process of Python crawling web pages, talking about python crawling code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Talking about the coding process of Python crawling web pages, talking about python crawling code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support