2013-07-30 22:54 by Lake, 2359 Read, 0 reviews, Favorites, compilation
Beautiful Soup is a html/xml parser written in Python that can handle nonstandard tags and generate parse trees very well. Typically used to analyze Web documents crawled by crawlers. For irregular HTML documents, there are many complementary functions, saving developers time and effort.
Beautiful Soup's official documentation is complete, and the official examples can be mastered once and for all. Official English documents, Chinese documents
One installation Beautiful Soup
Install BeautifulSoup is very simple, download BeautifulSoup source code. Decompression Run
Python setup.py install.
Test whether the installation was successful. Type Import BeautifulSoup If there is no exception, the installation is successful
Two use BeautifulSoup
1. Import BeautifulSoup, create BeautifulSoup Object
From BeautifulSoup import beautifulsoup # htmlfrom BeautifulSoup import beautifulstonesoup # XmlImport BeautifulSoup # all doc = [ '
2. Introduction to BeautifulSoup Objects
When parsing an HTML document with BeautifulSoup, BeautifulSoup handles the HTML document like a DOM document tree. There are three basic objects of the BeautifulSoup document tree.
2.1. Soup Beautifulsoup.beautifulsoup
Type (soup) <class ' Beautifulsoup.beautifulsoup ' >
2.2. Mark Beautifulsoup.tag
Type (soup.html) <class ' Beautifulsoup.tag ' >
2.3 Text beautifulsoup.navigablestring
Type (soup.title.string) <class ' beautifulsoup.navigablestring ' >
3. BeautifulSoup Parse Tree
3.1 Beautifulsoup.tag Object Methods
Get tag object (tag)
Tag name acquisition method, directly with the Soup object tag name, return the tag object. This way, it is useful to choose a unique label. or according to the structure of the tree to choose, a layer of choice
>>> html = soup.html>>> html
Content methods
The content method searches according to the document tree, returning a list of Tag objects (tag)
>>> Soup.contents[
>>> soup.contents[0].contents[
Use contents
the backward traversal tree, using the parent
forward traversal tree
Next method
Gets the descendant elements of the tree, including the Tag object and the Navigablestring object ...
>>> head.next<title>page title</title>>>> head.next.nextu ' Page title '
>>> p1 = soup.p>>> p1<p id= "Firstpara" align= "center" >this is Paragraph<b>one</b>. </p>>>> P1.nextu ' This is paragraph '
NextSibling next Sibling object includes the Tag object and the Navigablestring object
>>> head.nextsibling<body><p id= "Firstpara" align= "center" >this is paragraph<b>one</b >.</p><p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p></body >>>> p1.next.nextsibling<b>one</b>
Similar to nextSibling is the previoussibling, which is the previous sibling node.
ReplaceWith method
Replace object with, accept string argument
>>> head = soup.head>>> head
Search method
Search offers two methods, one is find and one is findall. The two methods here (FindAll and find) are valid only for the tag object and the top profile object, but navigablestring is not available.
findAll(
Name, Attrs, recursive, text, limit, **kwargs)Accept a parameter, sign the
Look for all P tags in the document and return a list
>>> soup.findall (' P ') [<p id= "Firstpara" align= "center" >this is paragraph<b>one</b>.</ P>, <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>]>>> type ( Soup.findall (' P ')) <type ' list ' >
Look for the P tag of id= "Secondpara", return a result set
>>> pid = Type (Soup.findall (' P ', id= ' Firstpara ') >>> pid<class ' Beautifulsoup.resultset ' >
Pass a property or multiple property pairs
>>> P2 = soup.findall (' p ', {' align ': ' Blah '}) >>> p2[<p id= "Secondpara" align= "blah" >this is Paragraph<b>two</b>.</p>]>>> type (p2) <class ' Beautifulsoup.resultset ' >
Using regular expressions
>>> Soup.findall (Id=re.compile ("para$")) [<p id= "Firstpara" align= "center" >this is paragraph<b> One</b>.</p>, <p id= "Secondpara" align= "blah" >this is paragraph<b>two</b>.</p>]
Reading and modifying properties
>>> p1 = soup.p>>> p1<p id= "Firstpara" align= "center" >this is Paragraph<b>one</b>. </p>>>> p1[' id ']u ' firstpara ' >>> p1[' id '] = ' Changeid ' >>> p1<p id= ' Changeid ' align = "Center" >this is paragraph<b>one</b>.</p>>>> p1[' class ') = ' new class ' >>> P1 <p id= "Changeid" align= "center" class= "new Class" >this is Paragraph<b>one</b>.</p>>> >
The basic methods of parsing trees are these, and others, and how to match regular expressions. Please see the official documentation for details.
3.2 Beautifulsoup.navigablestring Object Methods
Navigablestring object method is relatively simple, get its contents
>>> soup.title<title>page title</title>>>> title = soup.title.next>>> Titleu ' Page title ' >>> type (title) <class ' beautifulsoup.navigablestring ' >>>> Title.stringu ' Page Title
As for how to traverse the tree, and then analyze the document, the XML document analysis method, can refer to the official document.
Python BeautifulSoup Simple Notes