I used to talk about using PHANTOMJS as a crawler to catch a Web page http://www.jb51.net/article/55789.htm is made with a selector.
Using the Python module BeautifulSoup (document: http://www.crummy.com/software/BeautifulSoup/bs4/doc/), it's easy to crawl Web content
# coding=utf-8
import urllib from
bs4 import beautifulsoup
url = ' http://www.baidu.com/s '
values ={' WD ': ' Tennis '}
Encoded_param = Urllib.urlencode (values)
full_url = URL + '? ' + Encoded_param
response = Urllib.urlopen (full_url)
soup =beautifulsoup (response)
alinks = Soup.find_all (' a ')
The above can crawl Baidu search out the result is a tennis record.
BeautifulSoup has a number of very useful methods built into it.
Some of the more useful features:
Construct a node element
Copy Code code as follows:
Soup = BeautifulSoup (' <b class= ' boldest ">extremely bold</b> ')
Tag = soup.b
Type (TAG)
# <class ' Bs4.element.Tag ' >
Properties can be obtained using attr, and the result is a dictionary
Copy Code code as follows:
Tag.attrs
# {U ' class ': U ' boldest '}
or directly tag.class to take the attribute.
You can also manipulate properties freely
Tag[' class '] = ' verybold '
tag[' id ' = 1
tag
# <blockquote class= "Verybold" id= "1" >extremely bold< /blockquote>
del tag[' class ']
del tag[' id '
tag
# <blockquote>extremely bold</ Blockquote>
tag[' class ']
# keyerror: ' Class '
print (Tag.get (' class ')
# None
You can also find DOM elements, such as the following example, in a casual operation.
1. Build a document
Html_doc = "" "
soup = BeautifulSoup (Html_doc)
2. Various