Python BeautifulSoup Simple Notes

Source: Internet
Author: User
Tags tag name xml parser

2013-07-30 22:54 by Lake, 2359 Read, 0 reviews, Favorites, compilation

Beautiful Soup is a html/xml parser written in Python that can handle nonstandard tags and generate parse trees very well. Typically used to analyze Web documents crawled by crawlers. For irregular HTML documents, there are many complementary functions, saving developers time and effort.

Beautiful Soup's official documentation is complete, and the official examples can be mastered once and for all. Official English documents, Chinese documents

One installation Beautiful Soup

Install BeautifulSoup is very simple, download BeautifulSoup source code. Decompression Run

Python setup.py install.

Test whether the installation was successful. Type Import BeautifulSoup If there is no exception, the installation is successful

Two use BeautifulSoup

1. Import BeautifulSoup, create BeautifulSoup Object

From BeautifulSoup import beautifulsoup           # htmlfrom BeautifulSoup import beautifulstonesoup      # XmlImport BeautifulSoup                              # all                                                                                               doc = [    ' 

2. Introduction to BeautifulSoup Objects

When parsing an HTML document with BeautifulSoup, BeautifulSoup handles the HTML document like a DOM document tree. There are three basic objects of the BeautifulSoup document tree.

2.1. Soup Beautifulsoup.beautifulsoup

Type (soup) <class ' Beautifulsoup.beautifulsoup ' >

2.2. Mark Beautifulsoup.tag

Type (soup.html) <class ' Beautifulsoup.tag ' >

2.3 Text beautifulsoup.navigablestring

Type (soup.title.string) <class ' beautifulsoup.navigablestring ' >

3. BeautifulSoup Parse Tree

3.1 Beautifulsoup.tag Object Methods

Get tag object (tag)

Tag name acquisition method, directly with the Soup object tag name, return the tag object. This way, it is useful to choose a unique label. or according to the structure of the tree to choose, a layer of choice

>>> html = soup.html>>> html

Content methods

The content method searches according to the document tree, returning a list of Tag objects (tag)

>>> Soup.contents[
>>> soup.contents[0].contents[

Use contents the backward traversal tree, using the parent forward traversal tree

Next method

Gets the descendant elements of the tree, including the Tag object and the Navigablestring object ...

>>> head.next<title>page title</title>>>> head.next.nextu ' Page title '
>>> p1 = soup.p>>> p1<p id= "Firstpara" align= "center" >this is Paragraph<b>one</b>. </p>>>> P1.nextu ' This is paragraph '

NextSibling next Sibling object includes the Tag object and the Navigablestring object

>>> head.nextsibling<body><p id= "Firstpara" align= "center" >this is paragraph<b>one</b >.</p><p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p></body >>>> p1.next.nextsibling<b>one</b>

Similar to nextSibling is the previoussibling, which is the previous sibling node.

ReplaceWith method

Replace object with, accept string argument

>>> head = soup.head>>> head

Search method

Search offers two methods, one is find and one is findall. The two methods here (FindAll and find) are valid only for the tag object and the top profile object, but navigablestring is not available.

findAll(Name, Attrs, recursive, text, limit, **kwargs)

Accept a parameter, sign the

Look for all P tags in the document and return a list

>>> soup.findall (' P ') [<p id= "Firstpara" align= "center" >this is paragraph<b>one</b>.</ P>, <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>]>>> type ( Soup.findall (' P ')) <type ' list ' >

Look for the P tag of id= "Secondpara", return a result set

>>> pid = Type (Soup.findall (' P ', id= ' Firstpara ') >>> pid<class ' Beautifulsoup.resultset ' >

Pass a property or multiple property pairs

>>> P2 = soup.findall (' p ', {' align ': ' Blah '}) >>> p2[<p id= "Secondpara" align= "blah" >this is Paragraph<b>two</b>.</p>]>>> type (p2) <class ' Beautifulsoup.resultset ' >

Using regular expressions

>>> Soup.findall (Id=re.compile ("para$")) [<p id= "Firstpara" align= "center" >this is paragraph<b> One</b>.</p>, <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>]

Reading and modifying properties

>>> p1 = soup.p>>> p1<p id= "Firstpara" align= "center" >this is Paragraph<b>one</b>. </p>>>> p1[' id ']u ' firstpara ' >>> p1[' id '] = ' Changeid ' >>> p1<p id= ' Changeid ' align = "Center" >this is paragraph<b>one</b>.</p>>>> p1[' class ') = ' new class ' >>> P1 <p id= "Changeid" align= "center" class= "new Class" >this is Paragraph<b>one</b>.</p>>> >

The basic methods of parsing trees are these, and others, and how to match regular expressions. Please see the official documentation for details.

3.2 Beautifulsoup.navigablestring Object Methods

Navigablestring object method is relatively simple, get its contents

>>> soup.title<title>page title</title>>>> title = soup.title.next>>> Titleu ' Page title ' >>> type (title) <class ' beautifulsoup.navigablestring ' >>>> Title.stringu ' Page Title

As for how to traverse the tree, and then analyze the document, the XML document analysis method, can refer to the official document.

Python BeautifulSoup Simple Notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.