I've talked about using PHANTOMJS as a crawler to catch Web pages www.jb51.net/article/55789.htm is a match selector.
With BeautifulSoup (document: www.crummy.com/software/BeautifulSoup/bs4/doc/), this Python module makes it easy to crawl web content
# coding=utf-8import urllibfrom bs4 Import beautifulsoupurl = ' http://www.baidu.com/s ' values ={' wd ': ' Tennis '}encoded_param = Urllib.urlencode (values) full_url = URL + '? ' + Encoded_paramresponse = Urllib.urlopen (full_url) Soup =beautifulsoup (response) Alinks = Soup.find_all (' a ')
Above can crawl Baidu search out result is the record of tennis.
BeautifulSoup has built in a lot of very useful methods.
Some of the more useful features:
Constructs a node element
The code is as follows:
Soup = BeautifulSoup (' extremely bold ') tag = Soup.btype (tag) #
Properties can be obtained using attr, the result is a dictionary
The code is as follows:
tag.attrs# {u ' class ': U ' boldest '}
Or you can directly tag.class the property.
You can also freely manipulate properties
Tag[' class ' = ' Verybold ' tag[' id '] = 1tag#extremely bolddel tag[' class ']del tag[' id ']tag#extremely boldtag[' class ']# Keyerror: ' Class ' Print (Tag.get (' class ')) # None
You can also find DOM elements in a random way, such as the following example
1. Build a document
Html_doc = "" The Dormouse ' s storythe dormouse ' s storyonce upon a time there were three Little sisters; And their names Wereelsie,lacie Andtillie;and they lived at the bottom of a well .... "" "from BS4 import Beautifulsoupsoup = BeautifulSoup (Html_doc)
2. Various
Soup.head#the dormouse ' s storysoup.title#the dormouse ' s storysoup.body.b# the dormouse ' s storysoup.a# Elsiesoup.find_ All (' a ') # [elsie,# lacie,# tillie]head_tag = soup.headhead_tag#the dormouse ' s storyhead_tag.contents[the Dormouse ' s Story]title_tag = head_tag.contents[0]title_tag#the Dormouse ' s storytitle_tag.contents# [u ' the Dormouse ' s story ']len ( soup.contents) # 1soup.contents[0].name# u ' html ' text = title_tag.contents[0]text.contentsfor child in Title_ Tag.children: print (child) head_tag.contents# [the Dormouse's story]for child in Head_tag.descendants: print ( Child) #The Dormouse ' s story# the Dormouse ' s Storylen (list (Soup.children)) # 1len (List (soup.descendants)) # 25title_ tag.string# u ' the Dormouse ' s story '