This article mainly introduces how to install and use BeautifulSoup, a Python web parsing tool. This article uses a complete example to install BeautifulSoup step by step. if you need it, refer to the python parsing web page, no BeautifulSoup left or right. this is the Preface
Install
For installation after BeautifulSoup4, you need to use eazy_install. if you do not need the latest features, it is enough to install version 3. do not think that the old version is not good. if you think that it was used by millions of people at the beginning. Easy to install
The code is as follows:
$ Wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ Tar zxvf BeautifulSoup-3.2.1.tar.gz
Then put the BeautifulSoup. py file in it to the site-packages directory under your python installation directory.
Site-packages is the place where third-party Python packages are stored. as to where the directory is located, different systems can be found in the following ways.
The code is as follows:
$ Sudo find/-name "site-packages"-maxdepth 5-type d
$ Find ~ -Name "site-packages"-maxdepth 5
If you do not have the root permission, find the root directory of the current user.
The code is as follows:
$ Find ~ -Name "site-packages"-maxdepth 5-type d
If you are using a Mac, haha, you are blessed. I can tell you that the Mac Directory is under/Library/Python, there may be directories of multiple versions below. it doesn't matter. just put the directories in the latest version of site-packages. Import data before use
The code is as follows:
From BeautifulSoup import BeautifulSoup
Use
Let's look at an instance before using it.
Now I will give you such a page
The code is as follows:
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
It is a comedy movie under the Douban category. if you want to find the top 100 most rated movies, what should you do?
Well, let me share what I did first. since I am in the small white stage in CSS and there is no natural art bacteria, I will be able to see what the interface is doing. don't vomit.
Next, we will start to learn some basic BeautifulSoup methods and make the above page easier.
Given that the page of Douban is complex, let's take a simple example. suppose we process the following webpage code:
The code is as follows:
Page title
This is paragraph
One
.
This is paragraph
Two
.
You are not mistaken. this is an example in the official document. if you are patient, it is enough to read the official document.
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html
Initialization
First, the above HTML code is assigned to the variable html as follows. to make it easier for everyone to copy and paste here without carriage return, the code above with carriage return can let everyone see the HTML structure clearly.
The code is as follows:
Html ='Page title
This is paragraphOne.
This is paragraphTwo.
'
The initialization is as follows:
The code is as follows:
Soup = BeautifulSoup (html)
We know that HTML code can be regarded as a tree. this operation is equivalent to parsing HTML code into a tree-type data structure and storing it in soup. Note that the root node of this data structure is not, But soup. the html tag is the only sub-node of soup. if you do not believe it, try the following operations:
The code is as follows:
Print soup
Print soup. contents [0]
Print soup. contents [1]
The first two output results are consistent, that is, the entire html document. The third output reports an error IndexError: list index out of range.
SEARCH nodes
There are two forms of reverse lookup for a node. one is to return a single node, the other is to return the node list, and the corresponding lookup functions are find and findAll, respectively.
Single node
1. based on the node name
The code is as follows:
# Search for head nodes
Print soup. find ('head') # The output isPage title
# Or
# Head = soup. head
This method finds the node closest to the node to be searched. for example, here the node to be searched is soup, and here it finds a head closest to soup (if there are multiple nodes)
2. based on attributes
The code is as follows:
# Find the node whose id attribute is firstpara
Print soup. find (attrs = {'id': 'firstpara '})
# Output
This is paragraphOne.
# Combination of node names and attributes
Print soup. find ('P', attrs = {'id': 'firstpara'}) # output Same as above
3. based on the node relationship
Node relationships are nothing more than sibling nodes, such as parent and child nodes.
The code is as follows:
P1 = soup. find (attrs = {'id': 'firstpara'}) # obtain the first p node
Print p1.nextSibling # next sibling node
# Output
This is paragraphTwo.
P2 = soup. find (attrs = {'id': 'secondpara'}) # obtain the second p node
Print p2.previussibling # previous sibling node
# Output
This is paragraphOne.
Print p2.parent # parent node. the output is too long....
Print p2.contents [0] # output u'this is paragraph' on the first subnode'
Multiple nodes
Change "find" described above to "findAll" to return the list of found nodes. the required parameters are consistent.
1. based on the node name
The code is as follows:
# Search for all p nodes
Soup. findAll ('P ')
2. search by attribute
The code is as follows:
# Search for all nodes with id = firstpara
Soup. findAll (attrs = {'id': 'firstpara '})
Note that, although only one node is found in this example, a list object is returned.
The above basic search functions can handle most cases. if you need advanced search functions, such as regular expressions, you can go to the official documentation.
Get text
The getText method can be used to obtain all texts under a node. a character parameter can be passed to separate the texts between each node.
The code is as follows:
# Obtain the text under the head node
Soup. head. getText () # u'page title'
# Or
Soup. head. text
# Retrieve all text in the body and separate it with \ n
Soup. body. getText ('\ n') # U' This is paragraph \ none \ n. \ nThis is paragraph \ ntwo \ n .'
Practice
With these functions, the Demo provided at the beginning of the article will be ready. let's review this page of Douban.
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get the top 100 of all movies, you need to extract two pieces of information on this page: 1. flip links; 2. information about each movie (external links, images, rating, introduction, title, etc)
After extracting information about all movies, we sort the information by score and select the highest one. here we post the code for page turning and extracting movie information.
The code is as follows:
# Filename: Grab. py
From BeautifulSoup import BeautifulSoup, Tag
Import urllib2
Import re
From Log import LOG
Def LOG (* argv ):
Sys. stderr. write (* argv)
Sys. stderr. write ('\ n ')
Class Grab ():
Url =''
Soup = None
Def GetPage (self, url ):
If url. find ('http: // ', 0, 7 )! = 0:
Url = 'http: // '+ url
Self. url = url
LOG ('input url is: % s' % self. url)
Req = urllib2.Request (url, headers = {'user-Agent': "Magic Browser "})
Try:
Page = urllib2.urlopen (req)
Except t:
Return
Return page. read ()
Def ExtractInfo (self, buf ):
If not self. soup:
Try:
Self. soup = BeautifulSoup (buf)
Except t:
LOG ('soup failed in ExtractInfo: % s' % self. url)
Return
Try:
Items = self. soup. findAll (attrs = {'class': 'item '})
Except t:
LOG ('failed' on find items: % s' % self. url)
Return
Links = []
Objs = []
Titles = []
Scores = []
Comments = []
Intros = []
For item in items:
Try:
Pic = item. find (attrs = {'class': 'nbg '})
Link = pic ['href ']
Obj = pic. img ['src']
Info = item. find (attrs = {'class': 'pl2 '})
Title = re. sub ('[\ t] +', '', info. a. getText (). replace ('',''). replace ('\ n ',''))
Star = info. find (attrs = {'class': 'Star clearfix '})
Score = star. find (attrs = {'class': 'Rating _ nums'}). getText (). replace ('','')
Comment = star. find (attrs = {'class': 'pl'}). getText (). replace ('','')
Intro = info. find (attrs = {'class': 'Pl '}). getText (). replace ('','')
Except t Exception, e:
LOG ('process error in ExtractInfo: % s' % self. url)
Continue
Links. append (link)
Objs. append (obj)
Titles. append (title)
Scores. append (score)
Comments. append (comment)
Intros. append (intro)
Return (links, objs, titles, scores, comments, intros)
Def ExtractPageTurning (self, buf ):
Links = set ([])
If not self. soup:
Try:
Self. soup = BeautifulSoup (buf)
Except t:
LOG ('soup failed in ExtractPageTurning: % s' % self. url)
Return
Try:
Pageturning = self. soup. find (attrs = {'class': 'paginator '})
A_nodes = pageturning. findAll ('A ')
For a_node in a_nodes:
Href = a_node ['href ']
If href. find ('http: // ', 0, 7) =-1:
Href = self. url. split ('? ') [0] + href
Links. add (href)
Except t:
LOG ('Get pageturning failed in ExtractPageTurning: % s' % self. url)
Return links
Def Destroy (self ):
Del self. soup
Self. soup = None
Next, let's write a test example.
The code is as follows:
# Filename: test. py
# Encoding: UTF-8
From Grab import Grab
Import sys
Reload (sys)
Sys. setdefaultencoding ('utf-8 ')
Grab = Grab ()
Buf = grab. GetPage ('http: // movie.douban.com/tag/comedy? Start = 160 & type = t ')
If not buf:
Print 'getpage failed! '
Sys. exit ()
Links, objs, titles, scores, comments, intros = grab. ExtractInfo (buf)
For link, obj, title, score, comment, intro in zip (links, objs, titles, scores, comments, intros ):
Print link + '\ t' + obj +' \ t' + title + '\ t' + score +' \ t' + comment + '\ t' + intro
Pageturning = grab. ExtractPageTurning (buf)
For link in pageturning:
Print link
Grab. Destroy ()
OK. Let's take a look at the next step.
This article only introduces BeautifulSoup to help you quickly learn some basic things, I thought that I used all the functions in the source code of BeautifulSoup. a function is a function, and then I can see it, so I hope that later users will be able to master some basic functions in a more convenient way, and I will not give up this article in a single word, especially the layout of these codes. this is really a headache.