Python web parsing tool BeautifulSoup installation and usage introduction, pythonsoup
Python parses web pages without sending BeautifulSoup. This is the preface.
Install
For installation after BeautifulSoup4, you need to use eazy_install. If you do not need the latest features, it is enough to Install Version 3. Do not think that the old version is not good. If you think that it was used by millions of people at the beginning. Easy to install
Copy codeThe Code is as follows:
$ Wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ Tar zxvf BeautifulSoup-3.2.1.tar.gz
Then put the BeautifulSoup. py file in it to the site-packages directory under your python installation directory.
Site-packages is the place where third-party Python packages are stored. As to where the directory is located, different systems can be found in the following ways.
Copy codeThe Code is as follows:
$ Sudo find/-name "site-packages"-maxdepth 5-type d
$ Find ~ -Name "site-packages"-maxdepth 5
If you do not have the root permission, find the root directory of the current user.
Copy codeThe Code is as follows:
$ Find ~ -Name "site-packages"-maxdepth 5-type d
If you are using a Mac, haha, You are blessed. I can tell you that the Mac directory is under/Library/Python, there may be directories of multiple versions below. It doesn't matter. Just put the directories in the latest version of site-packages. Import data before use
Copy codeThe Code is as follows:
From BeautifulSoup import BeautifulSoup
Use
Let's look at an instance before using it.
Now I will give you such a page
Copy codeThe Code is as follows:
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
It is a comedy movie under the Douban category. If you want to find the top 100 most rated movies, what should you do?
Well, let me share what I did first. Since I am in the small white stage in CSS and there is no natural art bacteria, I will be able to see what the interface is doing. Don't vomit.
Next, we will start to learn some basic BeautifulSoup methods and make the above page easier.
Given that the page of Douban is complex, let's take a simple example. Suppose we process the following webpage code:
Copy codeThe Code is as follows:
<Html>
<Head> <title> Page title </title> <Body>
<P id = "firstpara" align = "center">
This is paragraph
<B>
One
</B>
.
</P>
<P id = "secondpara" align = "blah">
This is paragraph
<B>
Two
</B>
.
</P>
</Body>
</Html>
You are not mistaken. This is an example in the official document. If you are patient, it is enough to read the official document.
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html
Initialization
First, the above HTML code is assigned to the variable html as follows. To make it easier for everyone to copy and paste here without carriage return, the code above with carriage return can let everyone see the HTML structure clearly.
Copy codeThe Code is as follows:
Html = '
The initialization is as follows:
Copy codeThe Code is as follows:
Soup = BeautifulSoup (html)
We know that HTML code can be regarded as a tree. This operation is equivalent to parsing HTML code into a tree-type data structure and storing it in soup. Note that the root node of this data structure is not Copy codeThe Code is as follows:
Print soup
Print soup. contents [0]
Print soup. contents [1]
The first two output results are consistent, that is, the entire html document. The third output reports an error IndexError: list index out of range.
Search nodes
There are two forms of reverse lookup for a node. One is to return a single node, the other is to return the node list, and the corresponding lookup functions are find and findAll, respectively.
Single Node
1. Based on the node name
Copy codeThe Code is as follows:
# Search for head nodes
Print soup. find ('head') # output is # Or
# Head = soup. head
This method finds the node closest to the node to be searched. For example, here the node to be searched is soup, and here it finds a head closest to soup (if there are multiple nodes)
2. Based on Attributes
Copy codeThe Code is as follows:
# Find the node whose id attribute is firstpara
Print soup. find (attrs = {'id': 'firstpara '})
# Output as <p id = "firstpara" align = "center"> This is paragraph <B> one </B>. </p>
# Combination of node names and attributes
Print soup. find ('P', attrs = {'id': 'firstpara'}) # output same as above
3. Based on the node relationship
Node relationships are nothing more than sibling nodes, such as parent and child nodes.
Copy codeThe Code is as follows:
P1 = soup. find (attrs = {'id': 'firstpara'}) # obtain the first p Node
Print p1.nextSibling # next sibling Node
# Output <p id = "secondpara" align = "blah"> This is paragraph <B> two </B>. </p>
P2 = soup. find (attrs = {'id': 'secondpara'}) # obtain the second p Node
Print p2.previussibling # previous sibling Node
# Output <p id = "firstpara" align = "center"> This is paragraph <B> one </B>. </p>
Print p2.parent # parent node. The output is too long. The part <body>... </body> is omitted here.
Print p2.contents [0] # output u'this is paragraph' on the first subnode'
Multiple nodes
Change "find" described above to "findAll" to return the list of found nodes. The required parameters are consistent.
1. Based on the node name
Copy codeThe Code is as follows:
# Search for all p nodes
Soup. findAll ('P ')
2. Search by attribute
Copy codeThe Code is as follows:
# Search for all nodes with id = firstpara
Soup. findAll (attrs = {'id': 'firstpara '})
Note that, although only one node is found in this example, a list object is returned.
The above basic search functions can handle most cases. If you need advanced search functions, such as regular expressions, you can go to the official documentation.
Get Text
The getText method can be used to obtain all texts under a node. A character parameter can be passed to separate the texts between each node.
Copy codeThe Code is as follows:
# Obtain the text under the head node
Soup. head. getText () # u'page title'
# Or
Soup. head. text
# Retrieve all text in the body and separate it with \ n
Soup. body. getText ('\ n') # U' This is paragraph \ none \ n. \ nThis is paragraph \ ntwo \ n .'
Practice
With these functions, the Demo provided at the beginning of the article will be ready. Let's review this page of Douban.
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get the top 100 of all movies, you need to extract two pieces of information on this page: 1. Flip links; 2. Information about each movie (external links, images, rating, Introduction, title, etc)
After extracting information about all movies, we sort the information by score and select the highest one. Here we post the code for page turning and extracting movie information.
Copy codeThe Code is as follows:
# Filename: Grab. py
From BeautifulSoup import BeautifulSoup, Tag
Import urllib2
Import re
From Log import LOG
Def LOG (* argv ):
Sys. stderr. write (* argv)
Sys. stderr. write ('\ n ')
Class Grab ():
Url =''
Soup = None
Def GetPage (self, url ):
If url. find ('HTTP: // ', 0, 7 )! = 0:
Url = 'HTTP: // '+ url
Self. url = url
LOG ('input url is: % s' % self. url)
Req = urllib2.Request (url, headers = {'user-agent': "Magic Browser "})
Try:
Page = urllib2.urlopen (req)
Except t:
Return
Return page. read ()
Def ExtractInfo (self, buf ):
If not self. soup:
Try:
Self. soup = BeautifulSoup (buf)
Except t:
LOG ('soup failed in ExtractInfo: % s' % self. url)
Return
Try:
Items = self. soup. findAll (attrs = {'class': 'item '})
Except t:
LOG ('failed' on find items: % s' % self. url)
Return
Links = []
Objs = []
Titles = []
Scores = []
Comments = []
Intros = []
For item in items:
Try:
Pic = item. find (attrs = {'class': 'nbg '})
Link = pic ['href ']
Obj = pic. img ['src']
Info = item. find (attrs = {'class': 'pl2 '})
Title = re. sub ('[\ t] +', '', info. a. getText (). replace ('',''). replace ('\ n ',''))
Star = info. find (attrs = {'class': 'Star clearfix '})
Score = star. find (attrs = {'class': 'rating _ nums'}). getText (). replace ('','')
Comment = star. find (attrs = {'class': 'pl'}). getText (). replace ('','')
Intro = info. find (attrs = {'class': 'pl '}). getText (). replace ('','')
Except t Exception, e:
LOG ('process error in ExtractInfo: % s' % self. url)
Continue
Links. append (link)
Objs. append (obj)
Titles. append (title)
Scores. append (score)
Comments. append (comment)
Intros. append (intro)
Return (links, objs, titles, scores, comments, intros)
Def ExtractPageTurning (self, buf ):
Links = set ([])
If not self. soup:
Try:
Self. soup = BeautifulSoup (buf)
Except t:
LOG ('soup failed in ExtractPageTurning: % s' % self. url)
Return
Try:
Pageturning = self. soup. find (attrs = {'class': 'paginator '})
A_nodes = pageturning. findAll ('A ')
For a_node in a_nodes:
Href = a_node ['href ']
If href. find ('HTTP: // ', 0, 7) =-1:
Href = self. url. split ('? ') [0] + href
Links. add (href)
Except t:
LOG ('get pageturning failed in ExtractPageTurning: % s' % self. url)
Return links
Def Destroy (self ):
Del self. soup
Self. soup = None
Next, let's write a test example.
Copy codeThe Code is as follows:
# Filename: test. py
# Encoding: UTF-8
From Grab import Grab
Import sys
Reload (sys)
Sys. setdefaultencoding ('utf-8 ')
Grab = Grab ()
Buf = grab. GetPage ('HTTP: // movie.douban.com/tag/comedy? Start = 160 & type = t ')
If not buf:
Print 'getpage failed! '
Sys. exit ()
Links, objs, titles, scores, comments, intros = grab. ExtractInfo (buf)
For link, obj, title, score, comment, intro in zip (links, objs, titles, scores, comments, intros ):
Print link + '\ t' + obj +' \ t' + title + '\ t' + score +' \ t' + comment + '\ t' + intro
Pageturning = grab. ExtractPageTurning (buf)
For link in pageturning:
Print link
Grab. Destroy ()
OK. Let's take a look at the next step.
This article only introduces BeautifulSoup to help you quickly learn some basic things, I thought that I used all the functions in the source code of BeautifulSoup. A function is a function, and then I can see it, so I hope that later users will be able to master some basic functions in a more convenient way, and I will not give up this article in a single word, especially the layout of these codes. This is really a headache.