Python Web parsing tool beautifulsoup installation and usage introduction

Source: Internet
Author: User
Python parsing Web page, no beautifulsoup around, this is the preface

Installation

BEAUTIFULSOUP4 after the installation needs to use Eazy_install, if you do not need the latest features, install version 3 is enough, do not think that the old version of how bad, want to be tens of thousands of people in use AH. Installation is simple
Copy the Code code as follows:


$ wget "http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz"
$ tar zxvf beautifulsoup-3.2.1.tar.gz


And put the beautifulsoup.py in the Site-packages directory under your Python installation directory.

Site-packages is the place to store Python third-party packages, as to where this directory is, each system is different, can be found in the following way, basically can find
Copy the Code code as follows:


$ sudo find/-name "site-packages"-maxdepth 5-type D
$ find ~-name "Site-packages"-maxdepth 5


Of course, if you don't have root permissions, look for the root directory of the current user
Copy CodeThe code is as follows:


$ find ~-name "Site-packages"-maxdepth 5-type D


If you are using a Mac, Haha, you are blessed, I can directly tell you that the Mac directory under/library/python/, this may have more than one version of the directory, no matter, put in the latest version of the Site-packages on the line. Import first before use
Copy CodeThe code is as follows:


From BeautifulSoup import BeautifulSoup

Use

Let's take a look at an example before using it.
Now give you such a page
Copy the Code code as follows:


Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7


It's a comedy movie under the category of watercress, so if you find the top 100, what do you do?
Okay, I'll get a tan first. I do, in view of my CSS in the small white stage and no art bacteria, the interface do will be able to see, do not vomit

Then we began to learn some basic methods of beautifulsoup, it is easy to make the above page

Given the complexity of the Watercress page, let's start with a simple example, assuming we're working with the following page code
Copy the Code code as follows:



<title>Page Title</title>


This is paragraph

One

.



This is paragraph

Both

.





You are right, this is the official document in the example, if you have patience, look at the official documents is enough, you don't have to look at the back
Http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html

Initialization

First of all, the above HTML code is assigned to a variable HTML as follows, in order to make it easy for everyone to copy here is not with the carriage return, the above with the carriage return code can let everyone see clearly the HTML structure
Copy the Code code as follows:


html = ' <title>Page Title</title>

This is paragraphone.

This is paragraph.

'


Initialize as follows:
Copy CodeThe code is as follows:


Soup = beautifulsoup (HTML)


We know that HTML code can be regarded as a tree, this operation and so the HTML code is parsed into a tree data structure and stored in the soup, note that the data structure of the root node is not, but soup, where the HTML tag is the only child node of soup, do not believe you try the following operation
Copy CodeThe code is as follows:


Print Soup
Print Soup.contents[0]
Print Soup.contents[1]


The first two outputs are consistent, that is, the entire HTML document, the third output error Indexerror:list index out of range

Find nodes

The lookup node has two kinds of inverse form, one is to return a single node, and one is to return a list of nodes, and the corresponding lookup function is find and findall, respectively.

Single node

1. According to the node name
Copy the Code code as follows:


# # Find head node
Print Soup.find (' head ') # # output is <title>Page Title</title>
# # or
# # head = Soup.head

This method finds the node closest to the node being found, for example, where the node to be found is soup, and here is the closest head to soup (if there are multiple)

2. Depending on the attribute
Copy the Code code as follows:


# # Find a node with an id attribute of Firstpara
Print Soup.find (attrs={' id ': ' Firstpara '})
# # Output AS

This is paragraphone.


# # can also be combined with a node name and attributes
Print Soup.find (' P ', attrs={' id ': ' Firstpara '}) # # output Ibid.

3. Based on node relationship

The node relationship is nothing more than a sibling node, a parent-child node
Copy the Code code as follows:


P1 = Soup.find (attrs={' id ': ' Firstpara '}) # # Get the first P-node
Print P1.nextsibling # # Next sibling node
# # Output

This is paragraph.


P2 = soup.find (attrs={' id ': ' Secondpara '}) # # Get the second P-node
Print P2.previoussibling # # Previous sibling node
# # Output

This is paragraphone.


Print P2.parent # # parent node, output too long omitted section here...
Print P2.contents[0] # # First child node, Output U ' This is paragraph '

Multiple nodes

Change the lookup described above to FindAll to return to the list of nodes you are looking for, and the required parameters are consistent

1. According to the node name
Copy the Code code as follows:


# # Find all P nodes
Soup.findall (' P ')


2. Search by attribute
Copy CodeThe code is as follows:


# # Find all nodes of Id=firstpara
Soup.findall (attrs={' id ': ' Firstpara '})

It is important to note that although only one node is found in this example, the return is still a list object

These basic search functions are already available for most situations, and if you need advanced lookups, such as regular, you can see the official documentation

Get text

The GetText method can get all the text under a node, where a character argument can be passed to split the text between each node
Copy the Code code as follows:


# # Get the text under the head node
Soup.head.getText () # # u ' Page title '
# # or
Soup.head.text
# # Get all the text under the body and split it with \ n
Soup.body.getText (' \ n ') # # U ' This was Paragraph\none\n.\nthis is paragraph\ntwo\n. '

Actual combat

With these features, the article at the beginning of the demo is good to do, we will review the watercress this page
Http://movie.douban.com/tag/%E5%96%9C%E5%89%A7
If you want to get all the top 100 movies, you need to extract two messages for this page: 1, page link, 2, the information of each movie (outside chain, picture, score, Introduction, title, etc.)
When we extract the information from all the movies and sort by the score, the highest can be selected, and the Code for page-flipping extraction and movie information extraction is posted here.
Copy the Code code as follows:


# # filename:Grab.py
From BeautifulSoup import BeautifulSoup, Tag
Import Urllib2
Import re
From log import Log

def LOG (*ARGV):
Sys.stderr.write (*ARGV)
Sys.stderr.write (' \ n ')

Class Grab ():
url = ' '
Soup = None
def getpage (self, URL):
If Url.find ('/HTTP ', 0,7)! = 0:
url = '/http ' + URL
Self.url = URL
LOG (' Input URL is:%s '% self.url)
req = Urllib2. Request (URL, headers={' user-agent ': "Magic Browser"})
Try
page = Urllib2.urlopen (req)
Except
Return
Return Page.read ()

def extractinfo (SELF,BUF):
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in Extractinfo:%s '% self.url)
Return
Try
Items = Self.soup.findAll (attrs={' class ': ' Item '})
Except
LOG (' failed on find items:%s '% self.url)
Return
Links = []
OBJS = []
titles = []
scores = []
comments = []
intros = []
For item in items:
Try
pic = Item.find (attrs={' class ': ' NBG '})
link = pic[' href ']
obj = pic.img[' src ']
info = item.find (attrs={' class ': ' Pl2 '})
title = Re.sub (' [\t]+ ', ', ', Info.a.gettext (). Replace (', '). replace (' \ n ', ') ')
Star = Info.find (attrs={' class ': ' Star Clearfix '})
Score = Star.find (attrs={' class ': ' Rating_nums '}). GetText (). Replace (', ')
Comment = Star.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Intro = Info.find (attrs={' class ': ' pl '}). GetText (). Replace (', ')
Except Exception,e:
LOG (' Process error in Extractinfo:%s '% self.url)
Continue
Links.append (link)
Objs.append (obj)
Titles.append (title)
Scores.append (Score)
Comments.append (comment)
Intros.append (Intro)
Return (links, OBJS, titles, scores, comments, intros)

def extractpageturning (SELF,BUF):
Links = set ([])
If not self.soup:
Try
Self.soup = BeautifulSoup (BUF)
Except
LOG (' Soup failed in extractpageturning:%s '% self.url)
Return
Try
pageturning = Self.soup.find (attrs={' class ': ' Paginator '})
A_nodes = Pageturning.findall (' a ')
For A_node in A_nodes:
href = a_node[' href ']
If Href.find ('/HTTP ', 0,7) = =-1:
href = Self.url.split ('? ') [0] + href
Links.add (HREF)
Except
LOG (' Get pageturning failed in extractpageturning:%s '% self.url)

Return links

def Destroy (self):
Del Self.soup
Self.soup = None

And then we'll write a test sample.
Copy the Code code as follows:


# # filename:test.py
#encoding: Utf-8
From Grab import Grab
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')

Grab = Grab ()
BUF = Grab. GetPage (' http://movie.douban.com/tag/comedy? Start=160&type=t ')
If not BUF:
print ' GetPage failed! '
Sys.exit ()
Links, Objs, titles, scores, comments, intros = grab. Extractinfo (BUF)
For link, obj, title, score, Comment, intro in zip (links, Objs, titles, scores, comments, intros):
Print link+ ' \ t ' +obj+ ' \ t ' +title+ ' \ t ' +score+ ' \ t ' +comment+ ' \ t ' +intro
Pageturning = Grab. Extractpageturning (BUF)
For link in pageturning:
Print link
Grab. Destroy ()

OK, this is the next thing you can do.
This article just introduced the fur of BeautifulSoup, the purpose is to let everyone quickly learn some basic essentials, I want to use what function is to go to beautifulsoup the source code of a function of a function to see then will, a bitter tears ah, So I hope that the latter can be more convenient way to master some of the basic functions, but also not in vain I word knocked out this article, especially the layout of these code, really hurt the brain

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.