Python uses BeautifulSoup to implement crawlers

Source: Internet
Author: User
I've talked about using PHANTOMJS as a crawler to catch Web pages www.jb51.net/article/55789.htm is a match selector.

With BeautifulSoup (document: www.crummy.com/software/BeautifulSoup/bs4/doc/), this Python module makes it easy to crawl web content


# coding=utf-8import urllibfrom bs4 Import beautifulsoupurl = ' http://www.baidu.com/s ' values ={' wd ': ' Tennis '}encoded_param = Urllib.urlencode (values) full_url = URL + '? ' + Encoded_paramresponse = Urllib.urlopen (full_url) Soup =beautifulsoup (response) Alinks = Soup.find_all (' a ')

Above can crawl Baidu search out result is the record of tennis.

BeautifulSoup has built in a lot of very useful methods.

Some of the more useful features:

Constructs a node element

The code is as follows:

Soup = BeautifulSoup (' extremely bold ') tag = Soup.btype (tag) #

Properties can be obtained using attr, the result is a dictionary

The code is as follows:

tag.attrs# {u ' class ': U ' boldest '}

Or you can directly tag.class the property.

You can also freely manipulate properties


Tag[' class ' = ' Verybold ' tag[' id '] = 1tag#extremely bolddel tag[' class ']del tag[' id ']tag#extremely boldtag[' class ']# Keyerror: ' Class ' Print (Tag.get (' class ')) # None

You can also find DOM elements in a random way, such as the following example

1. Build a document


Html_doc = "" The Dormouse ' s storythe dormouse ' s storyonce upon a time there were three Little sisters;  And their names Wereelsie,lacie Andtillie;and they lived at the bottom of a well .... "" "from BS4 import Beautifulsoupsoup = BeautifulSoup (Html_doc)

2. Various


Soup.head#the dormouse ' s storysoup.title#the dormouse ' s storysoup.body.b# the dormouse ' s storysoup.a# Elsiesoup.find_ All (' a ') # [elsie,# lacie,# tillie]head_tag = soup.headhead_tag#the dormouse ' s storyhead_tag.contents[the Dormouse ' s Story]title_tag = head_tag.contents[0]title_tag#the Dormouse ' s storytitle_tag.contents# [u ' the Dormouse ' s story ']len ( soup.contents) # 1soup.contents[0].name# u ' html ' text = title_tag.contents[0]text.contentsfor child in Title_ Tag.children:  print (child) head_tag.contents# [the Dormouse's story]for child in Head_tag.descendants:  print ( Child) #The Dormouse ' s story# the Dormouse ' s Storylen (list (Soup.children)) # 1len (List (soup.descendants)) # 25title_ tag.string# u ' the Dormouse ' s story '
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.