Python parsing Web page, no beautifulsoup around, this is the preface
Installation
BEAUTIFULSOUP4 after the installation needs to use Eazy_install, if you do not need the latest features, install version 3 is enough, do not think that the old version of how bad, want to be tens of thousands of people in use AH. Installation is simple
Copy the Code code as follows:
$ wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ tar zxvf beautifulsoup-3.2.1.tar.gz
And put the beautifulsoup.py in the Site-packages directory under your Python installation directory.
Site-packages is the place to store Python third-party packages, as to where this directory is, each system is different, can be found in the following way, basically can find
Copy the Code code as follows:
$ sudo find/-name "site-packages"-maxdepth 5-type D
$ find ~-name "Site-packages"-maxdepth 5
Of course, if you don't have root permissions, look for the root directory of the current user
Copy CodeThe code is as follows:
$ find ~-name "Site-packages"-maxdepth 5-type D
If you are using a Mac, Haha, you are blessed, I can directly tell you that the Mac directory under/library/python/, this may have more than one version of the directory, no matter, put in the latest version of the Site-packages on the line. Import first before use
Copy CodeThe code is as follows:
From BeautifulSoup import BeautifulSoup
Use
Let's take a look at an example before using it.
Now give you such a page
Copy the Code code as follows:
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
It's a comedy movie under the category of watercress, so if you find the top 100, what do you do?
Okay, I'll get a tan first. I do, in view of my CSS in the small white stage and no art bacteria, the interface do will be able to see, do not vomit
Then we began to learn some basic methods of beautifulsoup, it is easy to make the above page
Given the complexity of the Watercress page, let's start with a simple example, assuming we're working with the following page code
Copy the Code code as follows:
<title>Page Title</title>
This is paragraph
One
.
This is paragraph
Both
.
You are right, this is the official document in the example, if you have patience, look at the official documents is enough, you don't have to look at the back
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html
Initialization
First of all, the above HTML code is assigned to a variable HTML as follows, in order to make it easy for everyone to copy here is not with the carriage return, the above with the carriage return code can let everyone see clearly the HTML structure
Copy the Code code as follows:
html = ' <title>Page Title</title>
This is paragraphone.
This is paragraph.
'
Initialize as follows:
Copy CodeThe code is as follows:
Soup = beautifulsoup (HTML)
We know that HTML code can be regarded as a tree, this operation and so the HTML code is parsed into a tree data structure and stored in the soup, note that the data structure of the root node is not, but soup, where the HTML tag is the only child node of soup, do not believe you try the following operation
Copy CodeThe code is as follows:
Print Soup
Print Soup.contents[0]
Print Soup.contents[1]
The first two outputs are consistent, that is, the entire HTML document, the third output error Indexerror:list index out of range
Find nodes
The lookup node has two kinds of inverse form, one is to return a single node, and one is to return a list of nodes, and the corresponding lookup function is find and findall, respectively.
Single node
1. According to the node name
Copy the Code code as follows:
# # Find head node
Print Soup.find (' head ') # # output is <title>Page Title</title>
# # or
# # head = Soup.head
This method finds the node closest to the node being found, for example, where the node to be found is soup, and here is the closest head to soup (if there are multiple)
2. Depending on the attribute
Copy the Code code as follows:
# # Find a node with an id attribute of Firstpara
Print Soup.find (attrs={' id ': ' Firstpara '})
# # Output AS
This is paragraphone.
# # can also be combined with a node name and attributes
Print Soup.find (' P ', attrs={' id ': ' Firstpara '}) # # output Ibid.
3. Based on node relationship
The node relationship is nothing more than a sibling node, a parent-child node
Copy the Code code as follows:
P1 = Soup.find (attrs={' id ': ' Firstpara '}) # # Get the first P-node
Print P1.nextsibling # # Next sibling node
# # Output
This is paragraph.
P2 = soup.find (attrs={' id ': ' Secondpara '}) # # Get the second P-node
Print P2.previoussibling # # Previous sibling node
# # Output
This is paragraphone.
Print P2.parent # # parent node, output too long omitted section here...
Print P2.contents[0] # # First child node, Output U ' This is paragraph '
Multiple nodes
Change the lookup described above to FindAll to return to the list of nodes you are looking for, and the required parameters are consistent
1. According to the node name
Copy the Code code as follows:
# # Find all P nodes
Soup.findall (' P ')
2. Search by attribute
Copy CodeThe code is as follows:
# # Find all nodes of Id=firstpara
Soup.findall (attrs={' id ': ' Firstpara '})
It is important to note that although only one node is found in this example, the return is still a list object
These basic search functions are already available for most situations, and if you need advanced lookups, such as regular, you can see the official documentation
Get text
The GetText method can get all the text under a node, where a character argument can be passed to split the text between each node
Copy the Code code as follows:
# # Get the text under the head node
Soup.head.getText () # # u ' Page title '
# # or
Soup.head.text
# # Get all the text under the body and split it with \ n
Soup.body.getText (' \ n ') # # U ' This was Paragraph\none\n.\nthis is paragraph\ntwo\n. '
Actual combat
With these features, the article at the beginning of the demo is good to do, we will review the watercress this page
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get all the top 100 movies, you need to extract two messages for this page: 1, page link, 2, the information of each movie (outside chain, picture, score, Introduction, title, etc.)
When we extract the information from all the movies and sort by the score, the highest can be selected, and the Code for page-flipping extraction and movie information extraction is posted here.
Copy the Code code as follows:
# # filename:Grab.py
From BeautifulSoup import BeautifulSoup, Tag
Import Urllib2
Import re
From log import Log
def LOG (*ARGV):
Sys.stderr.write (*ARGV)
Sys.stderr.write (' \ n ')
Class Grab ():
url = ' '
Soup = None
def getpage (self, URL):
If Url.find ('/HTTP ', 0,7)! = 0:
url = '/http ' + URL
Self.url = URL
LOG (' Input URL is:%s '% self.url)
req = Urllib2. Request (URL, headers={' user-agent ': "Magic Browser"})
Try
page = Urllib2.urlopen (req)
Except
Return
Return Page.read ()
def extractinfo (SELF,BUF):
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in Extractinfo:%s '% self.url)
Return
Try
Items = Self.soup.findAll (attrs={' class ': ' Item '})
Except
LOG (' failed on find items:%s '% self.url)
Return
Links = []
OBJS = []
titles = []
scores = []
comments = []
intros = []
For item in items:
Try
pic = Item.find (attrs={' class ': ' NBG '})
link = pic[' href ']
obj = pic.img[' src ']
info = item.find (attrs={' class ': ' Pl2 '})
title = Re.sub (' [\t]+ ', ', ', Info.a.gettext (). Replace (', '). replace (' \ n ', ') ')
Star = Info.find (attrs={' class ': ' Star Clearfix '})
Score = Star.find (attrs={' class ': ' Rating_nums '}). GetText (). Replace (', ')
Comment = Star.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Intro = Info.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Except Exception,e:
LOG (' Process error in Extractinfo:%s '% self.url)
Continue
Links.append (link)
Objs.append (obj)
Titles.append (title)
Scores.append (Score)
Comments.append (comment)
Intros.append (Intro)
Return (links, OBJS, titles, scores, comments, intros)
def extractpageturning (SELF,BUF):
Links = set ([])
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in extractpageturning:%s '% self.url)
Return
Try
pageturning = Self.soup.find (attrs={' class ': ' Paginator '})
A_nodes = Pageturning.findall (' a ')
For A_node in A_nodes:
href = a_node[' href ']
If Href.find ('/HTTP ', 0,7) = =-1:
href = Self.url.split ('? ') [0] + href
Links.add (HREF)
Except
LOG (' Get pageturning failed in extractpageturning:%s '% self.url)
Return links
def Destroy (self):
Del Self.soup
Self.soup = None
And then we'll write a test sample.
Copy the Code code as follows:
# # filename:test.py
#encoding: Utf-8
From Grab import Grab
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
Grab = Grab ()
BUF = Grab. GetPage (' http://movie.douban.com/tag/comedy? Start=160&type=t ')
If not BUF:
print ' GetPage failed! '
Sys.exit ()
Links, Objs, titles, scores, comments, intros = grab. Extractinfo (BUF)
For link, obj, title, score, Comment, intro in zip (links, Objs, titles, scores, comments, intros):
Print link+ ' \ t ' +obj+ ' \ t ' +title+ ' \ t ' +score+ ' \ t ' +comment+ ' \ t ' +intro
Pageturning = Grab. Extractpageturning (BUF)
For link in pageturning:
Print link
Grab. Destroy ()
OK, this is the next thing you can do.
This article just introduced the fur of BeautifulSoup, the purpose is to let everyone quickly learn some basic essentials, I want to use what function is to go to beautifulsoup the source code of a function of a function to see then will, a bitter tears ah, So I hope that the latter can be more convenient way to master some of the basic functions, but also not in vain I word knocked out this article, especially the layout of these code, really hurt the brain