On the encoding processing of Python crawling Web page _python

Source: Internet
Author: User

Background

Mid-Autumn Festival, a friend sent me an e-mail, said he was climbing the chain home, found that the code returned by the page is garbled, let me help his staff officers (mid-autumn overtime, really dedicated = =! In fact, this problem I met very early, before climbing a novel when a little look, but not when the matter, in fact, the problem is the understanding of the code is not in place.

Problem

A very common reptile code, the code is this:

# ecoding=utf-8
Import re
import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 '

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'
res = requests.get (URL)
Print Res.text

The purpose is actually very simple, is to crawl the content of the chain, but after this implementation, the return of the results, all involved in Chinese content, all will become garbled, such as

<script type= "Text/template" id= "Newaddhousetpl" >
 <div class= "Newaddhouse" >
  自仞æ ' ¨ä¸Šæ¬¡æµ 览(<%=time%>)之åžï¼œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåšäº†<%=count%>å¥-房æº
  <a href= "<%= Url%> "class=" lognewershoufangshow "<%=logText%>><%=linkText%></a>
  <span class=" Newhouserightclose ">x</span>
 </div>
</script>

Such data can be said to have no effect.

Problem analysis

The problem here is obvious, that is, the encoding of the text is incorrect, resulting in garbled.

View the encoding of a Web page

From the head of the crawled target page, the page is encoded with Utf-8.

<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 ">

So, the final code, we must also use Utf-8 to deal with, that is, the final text processing, to be decoded with utf-8, namely: Decode (' Utf-8 ')

Encoding and decoding of text

Python's encoding and decoding process is like this, source file = = = "Encode (encoding) = = =" Decode (decoding way), to a large extent, does not recommend the use

Import sys
Reload (SYS)
sys.setdefaultencoding (' UTF8 ')

This is the way to hard handle text encoding. However, in some cases, not to affect the situation, secretly lazy is not a big problem, but the comparison is recommended to get the source file, use encode and decode way to process the text.

Back to the question

The biggest problem now is how the source file is encoded, we normally use requests, it will automatically guess the source file encoding, and then transcoding into Unicode encoding, but, after all, is the program, it is possible to guess wrong, so if wrong, we need to manually specify the encoding method. The official documentation is described as follows:

When your make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests are used when you access R.text. You can find out what encoding Requests are using, and change it, using the R.encoding property.

So we need to see what encoding requests returns.

# ecoding=utf-8
Import re
import requests from
BS4 import beautifulsoup
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' Http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD /'

res = requests.get (URL)
print res.encoding

The results of the printing are as follows:

Iso-8859-1

In other words, the source file uses Iso-8859-1 to encode. Baidu Iso-8859-1, the results are as follows:

Iso8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages.

Problem solving

Found this dongdong, the problem is very good to solve, as long as the designated code, you can correctly play Chinese. The code is as follows:

# ecoding=utf-8
Import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get (URL)
res.encoding = (' UTF8 ')

print Res.text

The result of the printing is very obvious, the Chinese are displayed correctly.

Another way is to decode and encode on the source file, as follows:

# ecoding=utf-8
Import requests
import sys
reload (SYS)
sys.setdefaultencoding (' UTF8 ')

url = ' http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get (URL)
# res.encoding = (' UTF8 ')

print res.text.encode (' iso-8859-1 '). Decode (' Utf-8 ')

Another: Iso-8859-1 also called Latin1, using Latin1 to do the decoding result is also normal.

About the character of the code, many things can be said, want to know friends can refer to the following great God's information.

* The absolute Minimum Every Software Developer Absolutely, positively must Know about Unicode and Character Sets (No excu ses!) 》

The above article on the Python Crawl page encoding processing is small series to share all the content, hope to give you a reference, but also hope that we support the cloud habitat community.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.