Python Web Analytics Sharp Weapon BeautifulSoup Installation use introduction _python

Source: Internet
Author: User
Tags documentation gettext

Python parse Web page, not out of BeautifulSoup, this is the preface

Installation

BEAUTIFULSOUP4 after the installation needs to use Eazy_install, if you do not need the latest features, installation version 3 is enough, do not think that the old version of how bad, think the original is also used by millions of people. Installation is simple

Copy Code code as follows:

$ wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ tar zxvf beautifulsoup-3.2.1.tar.gz

And then put the beautifulsoup.py file in the Site-packages directory under your Python installation directory.

Site-packages is where the Python third party package is stored, and as to where the directory is, each system is different and can be found in the following ways, basically finding

Copy Code code as follows:

$ sudo find/-name "site-packages"-maxdepth 5-type D
$ find ~-name "Site-packages"-maxdepth 5

Of course, if you do not have root permission to find the root of the current user
Copy Code code as follows:

$ find ~-name "Site-packages"-maxdepth 5-type D

If you use a Mac, Haha, you are blessed, I can tell you directly, Mac This directory under/library/python/, this may have more than one version of the directory, it does not matter, put in the latest version of the Site-packages on the line. Import first before use
Copy Code code as follows:

From BeautifulSoup import BeautifulSoup

Use

Let's take a look at an example before using
Now give you such a page

Copy Code code as follows:

Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7

It is a watercress film classified comedy film, if you find the highest score in the 100, how to do it?
Okay, I'm going to get a tan. I do, in view of the CSS in the small white stage and naturally no art bacteria, the interface to do will be able to see, don't spit

Next we begin to learn some basic methods of beautifulsoup, it is easy to make the above page

Given that the Watercress page is more complex, let's take a simple example and let's say we deal with the following page code

Copy Code code as follows:

<body>
<p id= "Firstpara" align= "Center" >
This is paragraph
<b>
One
</b>
.
</p>
<p id= "Secondpara" align= "blah" >
This is paragraph
<b>
Two
</b>
.
</p>
</body>

You're right, that's the example in the official document, and if you're patient, it's enough to read the official document, and you don't have to look at the back.
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html

Class

First of all, the above HTML code to a variable HTML as follows, in order to facilitate the replication here is not with carriage return, the above with carriage return code can let everyone see clear HTML structure

Copy Code code as follows:

html = '

Initialized as follows:
Copy Code code as follows:

Soup = beautifulsoup (HTML)

We know that HTML code can be seen as a tree, this operation so that the HTML code to parse into a tree-type data structure and stored in the soup, note that the data structure of the root node is not
Copy Code code as follows:

Print Soup
Print Soup.contents[0]
Print Soup.contents[1]

The first two outputs are consistent, the entire HTML document, and the third output error Indexerror:list index out of range

Find nodes

The lookup node has two back forms, one is to return a single node, one is to return the node list, the corresponding lookup function is find and findall

Single node

1. According to the section names

Copy Code code as follows:

# # Find head node
Print Soup.find (' head ') # # output to # # or
# # head = Soup.head

This is the way to find the nearest node to find the node, such as the node to be found here is soup, where we find a head (if there are more than one) closest to soup.

2. According to the attribute

Copy Code code as follows:

# # Find id attribute to Firstpara node
Print Soup.find (attrs={' id ': ' Firstpara '})
# # output is <p id= ' Firstpara ' align= ' center ' >this is paragraph<b>one</b>.</p>
# # can also be grouped by section name and attributes
Print Soup.find (' P ', attrs={' id ': ' Firstpara '}) # output ditto

3. According to the node relationship

The node relationship is nothing more than a sibling node, a parent-child node

Copy Code code as follows:

P1 = Soup.find (attrs={' id ': ' Firstpara '}) # # Get the first p node
Print P1.nextsibling # # Next sibling node
# # Output <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>
P2 = soup.find (attrs={' id ': ' Secondpara '}) # # Gets the second P node
Print P2.previoussibling # # Previous sibling node
# # Output <p id= "Firstpara" align= "center" >this is paragraph<b>one</b>.</p>
Print P2.parent # # parent node, output too long here omitted section <body>...</body>
Print P2.contents[0] # # First child node, Output U ' This is paragraph '

Multiple nodes

You can return the found list of nodes by changing the find to FindAll and the required parameters are consistent

1. According to the section names

Copy Code code as follows:

# # Find all P nodes
Soup.findall (' P ')

2. Find by attribute
Copy Code code as follows:

# # Find all nodes of Id=firstpara
Soup.findall (attrs={' id ': ' Firstpara '})

It should be noted that although only one node is found in this example, the return is still a list object

These basic lookup features are already available for most situations, and if you need advanced lookups, such as regular, you can see the official documentation

Get text

The GetText method can get all the text under the node, which can pass a character argument to split the text between each node

Copy Code code as follows:

# # Get the text under the head node
Soup.head.getText () # # u ' Page title '
# # or
Soup.head.text
# # Get all text under body and split by \ n
Soup.body.getText (' \ n ') # # u ' This is Paragraph\none\n.\nthis is paragraph\ntwo\n. '

Actual combat

With these features, the beginning of the article to give the demo is good to do, we again to review the watercress in this page
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get the first 100 of all movies, you need to extract two information on this page: 1, page links, 2, each movie information (outside the chain, pictures, ratings, introduction, title, etc.)
When we extract all the information about the movie and then sort by the score, select the highest, and here, post the page extraction and movie Information extraction code

Copy Code code as follows:

# # filename:Grab.py
From BeautifulSoup import BeautifulSoup, Tag
Import Urllib2
Import re
From log import log

def LOG (*ARGV):
Sys.stderr.write (*ARGV)
Sys.stderr.write (' \ n ')

Class Grab ():
url = ' '
Soup = None
def getpage (self, URL):
If Url.find (' http://', 0,7)!= 0:
url = ' http://' + URL
Self.url = URL
LOG (' Input URL is:%s '% self.url)
req = Urllib2. Request (URL, headers={' user-agent ': "Magic Browser"})
Try
page = Urllib2.urlopen (req)
Except
Return
Return Page.read ()

def extractinfo (SELF,BUF):
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in Extractinfo:%s '% self.url)
Return
Try
Items = Self.soup.findAll (attrs={' class ': ' Item '})
Except
LOG (' failed on find items:%s '% self.url)
Return
Links = []
OBJS = []
titles = []
scores = []
comments = []
intros = []
For item in items:
Try
pic = Item.find (attrs={' class ': ' NBG '})
link = pic[' href ']
obj = pic.img[' src ']
info = item.find (attrs={' class ': ' Pl2 '})
title = Re.sub (' [\t]+ ', '], Info.a.gettext (). Replace (', '). replace (' \ n ', ')
Star = Info.find (attrs={' class ': ' Star Clearfix '})
Score = Star.find (attrs={' class ': ' Rating_nums '}). GetText (). Replace (', ')
Comment = Star.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Intro = Info.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Except Exception,e:
LOG (' Process error in Extractinfo:%s '% self.url)
Continue
Links.append (link)
Objs.append (obj)
Titles.append (title)
Scores.append (Score)
Comments.append (comment)
Intros.append (Intro)
Return (links, OBJS, titles, scores, comments, intros)

def extractpageturning (SELF,BUF):
Links = set ([])
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in extractpageturning:%s '% self.url)
Return
Try
pageturning = Self.soup.find (attrs={' class ': ' Paginator '})
A_nodes = Pageturning.findall (' a ')
For A_node in A_nodes:
href = a_node[' href ']
If Href.find (' http://', 0,7) = = 1:
href = Self.url.split ('? ') [0] + href
Links.add (HREF)
Except
LOG (' Get pageturning failed in extractpageturning:%s '% self.url)

Return links

def Destroy (self):
Del Self.soup
Self.soup = None

And then we'll write a test sample.

Copy Code code as follows:

# # filename:test.py
#encoding: Utf-8
From Grab import Grab
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')

Grab = Grab ()
BUF = Grab. GetPage (' http://movie.douban.com/tag/comedy? Start=160&type=t ')
If not BUF:
print ' GetPage failed! '
Sys.exit ()
Links, Objs, titles, scores, comments, intros = grab. Extractinfo (BUF)
For link, obj, title, score, Comment, intro in zip (links, Objs, titles, scores, comments, intros):
Print link+ ' \ t ' +obj+ ' \ t ' +title+ ' \ t ' +score+ ' \ t ' +comment+ ' \ t ' +intro
Pageturning = Grab. Extractpageturning (BUF)
For link in pageturning:
Print link
Grab. Destroy ()

OK, finish this step and then do it yourself.
This article just introduced the fur of BeautifulSoup, the purpose is to let everyone quickly learn some basic essentials, think the original I want to use what function is to beautifulsoup the source code in a function of a function look then will, a bitter tears ah, So I hope that later can be more convenient way to master some basic functions, but also not in vain I word knock out this article, especially the layout of the code, really hurt the brain

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.