Crawl pages using BS4 and URLLIB2, all pits

Last Update:2015-01-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today's toss a day using Python to crawl the news on the Sina portal, in fact, is not difficult, the key is to be stuck on the following three questions.

Issue one: Sina News return data in gzip format

After reading data at first, you want to use decode to convert the string you read to a Unicode string, which is obviously the usual routine for python to handle messy strings. But all morning in a variety of encode error, thought to be returned by the data contains a jumble of characters caused. Later I thought of myself in the internship with someone else's code to crawl the Web content, after a gzip process, so it is possible that the data returned by the server is compressed using gzip format.

So when you receive the data returned by the server, you can determine if "content-encoding" is the ' gzip ' format, and if so, unzip it with Gzip, or read the response data directly. You can see the following code.

#coding =utf8import urllib2from stringio import stringiofrom bs4 import beautifulsoupimport gzipdef loaddata (URL):    Request = Urllib2. Request (URL)    request.add_header (' accept-encoding ', ' gzip ')    response = Urllib2.urlopen (request)    Print Response.info (). Get (' content-encoding ')    if Response.info (). Get (' content-encoding ') = = ' gzip ':        print ' Response data is in gzip format. '        BUF = Stringio (Response.read ())        f = gzip. Gzipfile (fileobj=buf)        data = F.read ()    else:        data = Response.read ()    return data    if __name__ = = ' __main__ ':    page = loaddata (' http://news.sina.com.cn/')    soup = beautifulsoup (page, from_encoding= ' GB18030 ') )    print soup.prettify (). Encode (' GB18030 ')

Question two: string encoding problems

Usually contains Chinese pages, most of us can think of the use of GB series of coding, its main three kinds are GB2312, GBK and GB18030 three kinds, from the time development successively GB2312 < GBK <gb18030, The number of characters that are usually included is also the relationship. If you use GB2312 as a character that belongs to GBK, then we can specify from_encoding= ' when we create the BeautifulSoup GBK ', of course, because the GB18030 can encode the most characters, regardless of whether the Web page is using GB2312 or GBK encoding, you can always specify BeautifulSoup ' from_encoding= ' when constructing GB18030.

In addition, when dealing with character encoding in Python, our usual routine is to decode the string to Unicode when we read it, and we encode it when we output it, so that we can ensure that the Unicode type string is processed in memory.

Question three: the use of BeautifulSoup

BeautifulSoup is a convenient and efficient package that handles HTML or XML-formatted content, only a little bit. You can refer to the official document http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html of beautiful soup.

Spit groove, Life is not happy in all likelihood, hope now is not the best for the future life Save character Bar.

Crawl pages using BS4 and URLLIB2, all pits

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawl pages using BS4 and URLLIB2, all pits

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support