Crawl pages using BS4 and URLLIB2, all pits

Source: Internet
Author: User


Today's toss a day using Python to crawl the news on the Sina portal, in fact, is not difficult, the key is to be stuck on the following three questions.

Issue one: Sina News return data in gzip format

After reading data at first, you want to use decode to convert the string you read to a Unicode string, which is obviously the usual routine for python to handle messy strings. But all morning in a variety of encode error, thought to be returned by the data contains a jumble of characters caused. Later I thought of myself in the internship with someone else's code to crawl the Web content, after a gzip process, so it is possible that the data returned by the server is compressed using gzip format.

So when you receive the data returned by the server, you can determine if "content-encoding" is the ' gzip ' format, and if so, unzip it with Gzip, or read the response data directly. You can see the following code.

#coding =utf8import urllib2from stringio import stringiofrom bs4 import beautifulsoupimport gzipdef loaddata (URL):    Request = Urllib2. Request (URL)    request.add_header (' accept-encoding ', ' gzip ')    response = Urllib2.urlopen (request)    Print Response.info (). Get (' content-encoding ')    if Response.info (). Get (' content-encoding ') = = ' gzip ':        print ' Response data is in gzip format. '        BUF = Stringio (Response.read ())        f = gzip. Gzipfile (fileobj=buf)        data = F.read ()    else:        data = Response.read ()    return data    if __name__ = = ' __main__ ':    page = loaddata (' http://news.sina.com.cn/')    soup = beautifulsoup (page, from_encoding= ' GB18030 ') )    print soup.prettify (). Encode (' GB18030 ')

Question two: string encoding problems

Usually contains Chinese pages, most of us can think of the use of GB series of coding, its main three kinds are GB2312, GBK and GB18030 three kinds, from the time development successively GB2312 < GBK <gb18030, The number of characters that are usually included is also the relationship. If you use GB2312 as a character that belongs to GBK, then we can specify from_encoding= ' when we create the BeautifulSoup GBK ', of course, because the GB18030 can encode the most characters, regardless of whether the Web page is using GB2312 or GBK encoding, you can always specify BeautifulSoup ' from_encoding= ' when constructing GB18030.

In addition, when dealing with character encoding in Python, our usual routine is to decode the string to Unicode when we read it, and we encode it when we output it, so that we can ensure that the Unicode type string is processed in memory.

Question three: the use of BeautifulSoup

BeautifulSoup is a convenient and efficient package that handles HTML or XML-formatted content, only a little bit. You can refer to the official document http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html of beautiful soup.


Spit groove, Life is not happy in all likelihood, hope now is not the best for the future life Save character Bar.


Crawl pages using BS4 and URLLIB2, all pits

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.