Two ways to judge and crawl Web page encoding in Python

Source: Internet
Author: User
Tags execution in python

In web development, we often encounter web Capture and analysis, various languages can complete this function. I like to use Python, because Python provides a lot of mature modules, it is easy to achieve web crawler.

But in the crawl process will encounter coding problems, so today we look at how to determine the encoding of the page:

Many web pages on the Internet have different coding formats, mostly gbk,gb2312,utf-8.

After we get the data of the Web page, we must first judge the code of the Web page in order to transform the code of the crawled content into the encoding we can handle and avoid the appearance of garbled problem.

Here are two ways to determine Web page encoding:

Method One: Using the GetParam method of Urllib module

1 Import Urllib

2 #autor: pythontab.com

3 Fopen1 = Urllib.urlopen (' http://www.baidu.com '). info ()

4 Print Fopen1.getparam (' CharSet ') # Baidu

The results of the execution are:

Gbk

Oh, in fact, the above to get the code is not correct, we can open their own web page to view the source code, found that Baidu is gb2312. Alas, this method does have a bit of a pit dad. Detection is inaccurate, not detected, it is accounted for, so very unreliable, the following introduction a reliable method.

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/

Method Two: Using the Chardet module

#如果你的python没有安装chardet模块, you need to first install the Chardet to determine the coding module Oh 
#author:p ythontab.com
import chardet 
import Urllib
#先获取网页内容
data1 = Urllib.urlopen (' http://www.baidu.com '). Read ()
#用chardet进行内容分析
chardit1 = Chardet.detect (data1)
    
print chardit1[' encoding '] # Baidu

The results of the execution are:

gb2312

This result is all right oh, you can go to personally verify ~ ~

Summary: The second method is very accurate, in the Web page encoding analysis of the use of Python module to analyze content is the most accurate, and the use of the analysis of Meta header information is not very accurate.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.