This article mainly introduces python using beautifulSoup to implement crawler, need friends can refer to the previous said using phantomjs crawling web http://www.jb51.net/article/55789.htm is with selector to do
Using the beautifulSoup (document: http://www.crummy.com/software/BeautifulSoup/bs4/doc/) python module, you can easily capture web content
# Coding = utf-8import urllibfrom bs4 import BeautifulSoupurl = 'http: // done = {'wd ': 'Tennis'} encoded_param = urllib. urlencode (values) full_url = url + '? '+ Encoded_paramresponse = urllib. urlopen (full_url) soup = BeautifulSoup (response) alinks = soup. find_all ('A ')
The above results can be captured by Baidu and found as a tennis record.
BeautifulSoup has many built-in useful methods.
Several useful features:
Construct a node element
The code is as follows:
Soup = BeautifulSoup ('Extremely bold')
Tag = soup. B
Type (tag)
#
Attributes can be obtained using attr and the result is a dictionary.
The code is as follows:
Tag. attrs
# {U'class': u'boldest '}
Or you can directly retrieve the attributes of tag. class.
You can also operate attributes freely.
tag['class'] = 'verybold'tag['id'] = 1tag#
Extremely bold
del tag['class']del tag['id']tag#
Extremely bold
tag['class']# KeyError: 'class'print(tag.get('class'))# None
You can also search for dom elements as needed, for example, the following example:
1. build a document
html_doc = """The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc)
2. various operations
soup.head# The Dormouse's storysoup.title# The Dormouse's storysoup.body.b# The Dormouse's storysoup.a# Elsiesoup.find_all('a')# [Elsie,# Lacie,# Tillie]head_tag = soup.headhead_tag# The Dormouse's storyhead_tag.contents[The Dormouse's story]title_tag = head_tag.contents[0]title_tag# The Dormouse's storytitle_tag.contents# [u'The Dormouse's story']len(soup.contents)# 1soup.contents[0].name# u'html'text = title_tag.contents[0]text.contentsfor child in title_tag.children: print(child)head_tag.contents# [The Dormouse's story]for child in head_tag.descendants: print(child)# The Dormouse's story# The Dormouse's storylen(list(soup.children))# 1len(list(soup.descendants))# 25title_tag.string# u'The Dormouse's story'