On the encoding processing of Python crawling Web page

On the encoding processing of Python crawling Web page _python

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

Mid-Autumn Festival, a friend sent me an e-mail, said he was climbing the chain home, found that the code returned by the page is garbled, let me help his staff officers (mid-autumn overtime, really dedicated = =! In fact, this problem I met very early, before climbing a novel when a little look, but not when the matter, in fact, the problem is the understanding of the code is not in place.

Problem

A very common reptile code, the code is this:

# ecoding=utf-8
Import re
import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 '

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'
res = requests.get (URL)
Print Res.text

The purpose is actually very simple, is to crawl the content of the chain, but after this implementation, the return of the results, all involved in Chinese content, all will become garbled, such as

<script type= "Text/template" id= "Newaddhousetpl" >
 <div class= "Newaddhouse" >
  è‡ªä»žæ ' ¨ä¸Šæ¬¡æµ è§ˆï¼ˆ<%=time%>ï¼‰ä¹‹åžï¼œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåšäº†<%=count%>å¥-æˆ¿æº
  <a href= "<%= Url%> "class=" lognewershoufangshow "<%=logText%>><%=linkText%></a>
  <span class=" Newhouserightclose ">x</span>
 </div>
</script>

Such data can be said to have no effect.

Problem analysis

The problem here is obvious, that is, the encoding of the text is incorrect, resulting in garbled.

View the encoding of a Web page

From the head of the crawled target page, the page is encoded with Utf-8.

<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 ">

So, the final code, we must also use Utf-8 to deal with, that is, the final text processing, to be decoded with utf-8, namely: Decode (' Utf-8 ')

Encoding and decoding of text

Python's encoding and decoding process is like this, source file = = = "Encode (encoding) = = =" Decode (decoding way), to a large extent, does not recommend the use

Import sys
Reload (SYS)
sys.setdefaultencoding (' UTF8 ')

This is the way to hard handle text encoding. However, in some cases, not to affect the situation, secretly lazy is not a big problem, but the comparison is recommended to get the source file, use encode and decode way to process the text.

Back to the question

The biggest problem now is how the source file is encoded, we normally use requests, it will automatically guess the source file encoding, and then transcoding into Unicode encoding, but, after all, is the program, it is possible to guess wrong, so if wrong, we need to manually specify the encoding method. The official documentation is described as follows:

When your make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests are used when you access R.text. You can find out what encoding Requests are using, and change it, using the R.encoding property.

So we need to see what encoding requests returns.

# ecoding=utf-8
Import re
import requests from
BS4 import beautifulsoup
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' Http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD /'

res = requests.get (URL)
print res.encoding

The results of the printing are as follows:

Iso-8859-1

In other words, the source file uses Iso-8859-1 to encode. Baidu Iso-8859-1, the results are as follows:

Iso8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages.

Problem solving

Found this dongdong, the problem is very good to solve, as long as the designated code, you can correctly play Chinese. The code is as follows:

# ecoding=utf-8
Import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get (URL)
res.encoding = (' UTF8 ')

print Res.text

The result of the printing is very obvious, the Chinese are displayed correctly.

Another way is to decode and encode on the source file, as follows:

# ecoding=utf-8
Import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get (URL)
# res.encoding = (' UTF8 ')

print res.text.encode (' iso-8859-1 '). Decode (' Utf-8 ')

Another: Iso-8859-1 also called Latin1, using Latin1 to do the decoding result is also normal.

About the character of the code, many things can be said, want to know friends can refer to the following great God's information.

* The absolute Minimum Every Software Developer Absolutely, positively must Know about Unicode and Character Sets (No excu ses!) 》

The above article on the Python Crawl page encoding processing is small series to share all the content, hope to give you a reference, but also hope that we support the cloud habitat community.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

On the encoding processing of Python crawling Web page _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

On the encoding processing of Python crawling Web page _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support