Python parse Web page, not out of BeautifulSoup, this is the preface
Installation
BEAUTIFULSOUP4 after the installation needs to use Eazy_install, if you do not need the latest features, installation version 3 is enough, do not think that the old version of how bad, think the original is also used by millions of people. Installation is simple
Copy Code code as follows:
$ wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ tar zxvf beautifulsoup-3.2.1.tar.gz
And then put the beautifulsoup.py file in the Site-packages directory under your Python installation directory.
Site-packages is where the Python third party package is stored, and as to where the directory is, each system is different and can be found in the following ways, basically finding
Copy Code code as follows:
$ sudo find/-name "site-packages"-maxdepth 5-type D
$ find ~-name "Site-packages"-maxdepth 5
Of course, if you do not have root permission to find the root of the current user
Copy Code code as follows:
$ find ~-name "Site-packages"-maxdepth 5-type D
If you use a Mac, Haha, you are blessed, I can tell you directly, Mac This directory under/library/python/, this may have more than one version of the directory, it does not matter, put in the latest version of the Site-packages on the line. Import first before use
Copy Code code as follows:
From BeautifulSoup import BeautifulSoup
Use
Let's take a look at an example before using
Now give you such a page
Copy Code code as follows:
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
It is a watercress film classified comedy film, if you find the highest score in the 100, how to do it?
Okay, I'm going to get a tan. I do, in view of the CSS in the small white stage and naturally no art bacteria, the interface to do will be able to see, don't spit
Next we begin to learn some basic methods of beautifulsoup, it is easy to make the above page
Given that the Watercress page is more complex, let's take a simple example and let's say we deal with the following page code
Copy Code code as follows:
<body>
<p id= "Firstpara" align= "Center" >
This is paragraph
<b>
One
</b>
.
</p>
<p id= "Secondpara" align= "blah" >
This is paragraph
<b>
Two
</b>
.
</p>
</body>
You're right, that's the example in the official document, and if you're patient, it's enough to read the official document, and you don't have to look at the back.
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html
Class
First of all, the above HTML code to a variable HTML as follows, in order to facilitate the replication here is not with carriage return, the above with carriage return code can let everyone see clear HTML structure
Copy Code code as follows:
Initialized as follows:
Copy Code code as follows:
Soup = beautifulsoup (HTML)
We know that HTML code can be seen as a tree, this operation so that the HTML code to parse into a tree-type data structure and stored in the soup, note that the data structure of the root node is not
Copy Code code as follows:
Print Soup
Print Soup.contents[0]
Print Soup.contents[1]
The first two outputs are consistent, the entire HTML document, and the third output error Indexerror:list index out of range
Find nodes
The lookup node has two back forms, one is to return a single node, one is to return the node list, the corresponding lookup function is find and findall
Single node
1. According to the section names
Copy Code code as follows:
# # Find head node
Print Soup.find (' head ') # # output to # # or
# # head = Soup.head
This is the way to find the nearest node to find the node, such as the node to be found here is soup, where we find a head (if there are more than one) closest to soup.
2. According to the attribute
Copy Code code as follows:
# # Find id attribute to Firstpara node
Print Soup.find (attrs={' id ': ' Firstpara '})
# # output is <p id= ' Firstpara ' align= ' center ' >this is paragraph<b>one</b>.</p>
# # can also be grouped by section name and attributes
Print Soup.find (' P ', attrs={' id ': ' Firstpara '}) # output ditto
3. According to the node relationship
The node relationship is nothing more than a sibling node, a parent-child node
Copy Code code as follows:
P1 = Soup.find (attrs={' id ': ' Firstpara '}) # # Get the first p node
Print P1.nextsibling # # Next sibling node
# # Output <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>
P2 = soup.find (attrs={' id ': ' Secondpara '}) # # Gets the second P node
Print P2.previoussibling # # Previous sibling node
# # Output <p id= "Firstpara" align= "center" >this is paragraph<b>one</b>.</p>
Print P2.parent # # parent node, output too long here omitted section <body>...</body>
Print P2.contents[0] # # First child node, Output U ' This is paragraph '
Multiple nodes
You can return the found list of nodes by changing the find to FindAll and the required parameters are consistent
1. According to the section names
Copy Code code as follows:
# # Find all P nodes
Soup.findall (' P ')
2. Find by attribute
Copy Code code as follows:
# # Find all nodes of Id=firstpara
Soup.findall (attrs={' id ': ' Firstpara '})
It should be noted that although only one node is found in this example, the return is still a list object
These basic lookup features are already available for most situations, and if you need advanced lookups, such as regular, you can see the official documentation
Get text
The GetText method can get all the text under the node, which can pass a character argument to split the text between each node
Copy Code code as follows:
# # Get the text under the head node
Soup.head.getText () # # u ' Page title '
# # or
Soup.head.text
# # Get all text under body and split by \ n
Soup.body.getText (' \ n ') # # u ' This is Paragraph\none\n.\nthis is paragraph\ntwo\n. '
Actual combat
With these features, the beginning of the article to give the demo is good to do, we again to review the watercress in this page
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get the first 100 of all movies, you need to extract two information on this page: 1, page links, 2, each movie information (outside the chain, pictures, ratings, introduction, title, etc.)
When we extract all the information about the movie and then sort by the score, select the highest, and here, post the page extraction and movie Information extraction code
Copy Code code as follows:
# # filename:Grab.py
From BeautifulSoup import BeautifulSoup, Tag
Import Urllib2
Import re
From log import log
def LOG (*ARGV):
Sys.stderr.write (*ARGV)
Sys.stderr.write (' \ n ')
Class Grab ():
url = ' '
Soup = None
def getpage (self, URL):
If Url.find (' http://', 0,7)!= 0:
url = ' http://' + URL
Self.url = URL
LOG (' Input URL is:%s '% self.url)
req = Urllib2. Request (URL, headers={' user-agent ': "Magic Browser"})
Try
page = Urllib2.urlopen (req)
Except
Return
Return Page.read ()
def extractinfo (SELF,BUF):
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in Extractinfo:%s '% self.url)
Return
Try
Items = Self.soup.findAll (attrs={' class ': ' Item '})
Except
LOG (' failed on find items:%s '% self.url)
Return
Links = []
OBJS = []
titles = []
scores = []
comments = []
intros = []
For item in items:
Try
pic = Item.find (attrs={' class ': ' NBG '})
link = pic[' href ']
obj = pic.img[' src ']
info = item.find (attrs={' class ': ' Pl2 '})
title = Re.sub (' [\t]+ ', '], Info.a.gettext (). Replace (', '). replace (' \ n ', ')
Star = Info.find (attrs={' class ': ' Star Clearfix '})
Score = Star.find (attrs={' class ': ' Rating_nums '}). GetText (). Replace (', ')
Comment = Star.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Intro = Info.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Except Exception,e:
LOG (' Process error in Extractinfo:%s '% self.url)
Continue
Links.append (link)
Objs.append (obj)
Titles.append (title)
Scores.append (Score)
Comments.append (comment)
Intros.append (Intro)
Return (links, OBJS, titles, scores, comments, intros)
def extractpageturning (SELF,BUF):
Links = set ([])
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in extractpageturning:%s '% self.url)
Return
Try
pageturning = Self.soup.find (attrs={' class ': ' Paginator '})
A_nodes = Pageturning.findall (' a ')
For A_node in A_nodes:
href = a_node[' href ']
If Href.find (' http://', 0,7) = = 1:
href = Self.url.split ('? ') [0] + href
Links.add (HREF)
Except
LOG (' Get pageturning failed in extractpageturning:%s '% self.url)
Return links
def Destroy (self):
Del Self.soup
Self.soup = None
And then we'll write a test sample.
Copy Code code as follows:
# # filename:test.py
#encoding: Utf-8
From Grab import Grab
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
Grab = Grab ()
BUF = Grab. GetPage (' http://movie.douban.com/tag/comedy? Start=160&type=t ')
If not BUF:
print ' GetPage failed! '
Sys.exit ()
Links, Objs, titles, scores, comments, intros = grab. Extractinfo (BUF)
For link, obj, title, score, Comment, intro in zip (links, Objs, titles, scores, comments, intros):
Print link+ ' \ t ' +obj+ ' \ t ' +title+ ' \ t ' +score+ ' \ t ' +comment+ ' \ t ' +intro
Pageturning = Grab. Extractpageturning (BUF)
For link in pageturning:
Print link
Grab. Destroy ()
OK, finish this step and then do it yourself.
This article just introduced the fur of BeautifulSoup, the purpose is to let everyone quickly learn some basic essentials, think the original I want to use what function is to beautifulsoup the source code in a function of a function look then will, a bitter tears ah, So I hope that later can be more convenient way to master some basic functions, but also not in vain I word knock out this article, especially the layout of the code, really hurt the brain