1, installation BS4
I use the ubuntu14.4, just use the Apt-get command.
sudo Install Python-bs4
2. Install the parser
Beautiful soup supports the HTML parser in the Python standard library and also supports some third-party parsers, one of which is lxml.
sudo Install Python-lxml
3. How to use
By passing a document into the BeautifulSoup constructor, you can get a document object that can pass in a string or a file handle.
from Import = BeautifulSoup (open ("index.html"= BeautifulSoup ("< html>data")
4. Types of objects
Beautfiful soup transforms complex HTML documents into a complex tree structure, where each node is a Python object, and all objects can be summed up in 4 types: tag,navigablestring,beautifulsoup,comment.
Tag
The tag object is the same as the tag in the XML or HMTL native document:
Soup = BeautifulSoup ('<b class= "boldest" >extremely bold</b>'= Soup.btype (tag)# <class ' Bs4.element.Tag ' >
Each tag has its own name, which is obtained through the. Name:
Tag.name # u ' B '
A tag may have many properties.
tag['class']# u ' boldest '
Tag.attrs # {u ' class ': U ' boldest '}
Navigablestring
Strings are often contained within tags.
tag.string # u ' extremely bold ' type (tag.string) # <class ' bs4.element.NavigableString ' >
BeautifulSoup
The BeautifulSoup object represents the entire contents of a document.
Soup class="boldest">extremely bold</b></body> Type (soup)<class'bs4. BeautifulSoup'>
Comment
Typically, the comment section of the document is represented.
5. Traverse the document Tree
Tag's name
You can get the tag by taking a property, and you can call it multiple times.
Soup.head # Soup.title # <title>the dormouse ' s story</title>
You can only get the first tag of the current name by taking a property:
Soup.a # <a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>
If you want to get all the A tags
Soup.find_all ('a')# [<a class= "sister" href= "/HTTP/ Example.com/elsie "id=" Link1 ">ELSIE</A>,# <a class=" sister "href=" http:// Example.com/lacie "id=" Link2 ">LACIE</A>,# <a class=" sister "href=" http:// Example.com/tillie "id=" Link3 ">TILLIE</A>]
6. Search the document tree
Beautiful soup The most important search methods are two: Find (), Find_all ().
Filter filters
The simplest filter is the string
Soup.find_all ('b')# [<b>the dormouse ' s story</b>]
By passing in a regular expression as a parameter
Import Re for in Soup.find_all (Re.compile ("^b")): Print (Tag.name) # Body # b
Incoming list parameters
Soup.find_all (["a""b"])# [<b> The Dormouse ' s story</b>,# <a class= "sister" href= "Http://example.com/elsie" id= " Link1 ">ELSIE</A>,# <a class=" sister "href=" Http://example.com/lacie "id=" Link2 ">LACIE</A>,# <a class=" sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]
If there is no suitable filter, you can also customize the method
Find_all ()
Find_all (name, Attrs, recursive, text, **kwargs)
Name parameter
The name parameter can find all tags named name, such as title\head\body\p, etc.
Keyword parameters
If a parameter of the specified name is not a search for the built-in parameter name, the search will search for the parameter as a property of the specified name tag, and if a parameter named ID is included, Beautiful soup will search for each tag's "id" attribute.
Soup. Find_all(id=' link2 ')# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]
If the href parameter is passed in, Beautiful soup searches for the "href" attribute of each tag:
Soup. Find_all(href=re. ) Compile("Elsie"))# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie </a>]
A parameter value that can be used when searching for a property of a specified name includes a string, a regular expression, a list, and True.
The following example finds all tags that contain the ID attribute in the document tree, regardless of the value of the ID :
Soup. Find_all(id=True)# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]
You can filter multiple properties of a tag at the same time by using multiple parameters of the specified name:
Soup. Find_all(href=re. ) Compile("Elsie"id=' Link1 ')# [<a class= ' sister ' href= ' http://example.com/ Elsie "id=" Link1 ">THREE</A>]
Search by CSS
Class is class_ in Beatifulsoup because it conflicts with the Python keyword
The class_ parameter also accepts different types of filters , strings, regular expressions, methods, or True
Text parameter
The text parameter can search for string content in a document. As with the optional value of the name parameter, the text parameter accepts a string, a regular expression, a list, and True.
Like callingFind_all ()The same call tag
Find_all () is almost the most commonly used search method in beautiful soup, so we have defined its shorthand method. The beautifulsoup object and the tag object can be used as a method that executes the same as the Find_all () method that invokes the object, and the following two lines of code are equivalent:
Soup. Find_all("a")soup("a ")
These two lines of code are also equivalent:
Soup. Title. Find_all(text=True)soup. Title(text=True)
CSS Selector
Beautiful soup supports most CSS selectors [6], passing string parameters in the . Select () method of the Tag or beautifulsoup object. You can use the CSS selector syntax to find the tag:
Soup. Select("title")# [<title>the dormouse ' s story</title>]soup. Select("P Nth-of-type (3)")# [<p class= "story" >...</P>]
Search by layer by tag tag:
Soup. Select("Body A")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]soup. Select("HTML head title")# [<title>the dormouse ' s story</title>]
Locate the direct sub-label under a tag tag [6]:
Soup.Select("Head > title")# [<title>the dormouse ' s story</title>]Soup.Select("P > a")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A># <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]soup. Select("p > A:nth-of-type (2)")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]soup. Select("p > #link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>]soup. Select("Body > A")# []
Find the sibling node tag:
Soup. Select("#link1 ~. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>,# <a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>] Soup. Select("#link1 +. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>]
Look through the class name of the CSS:
soup. Select ( "sister" ) # [<a class= "Sister" href= " Http://example.com/elsie "id=" Link1 ">elsie</a>,# <a class=" sister "href="/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">tillie</a>]soup. Select ( "[Class~=sister]" ) # [<a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href= "/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">TILLIE</A>]
Find by Tag ID:
Soup. Select("#link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ]soup. Select("A#link2")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie </a>]
Find by whether a property exists:
Soup. Select(' a[href] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]
To find by the value of the property:
Soup.Select(' a[href= ' Http://example.com/elsie "] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]Soup.Select ( ' a[href^= "http://example.com/"] ' ) # [<a class = "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href = "Http://example.com/lacie" id= "Link2" >lacie</a>,# <a class= "sister" href= "/HTTP/ Example.com/tillie "id=" Link3 ">tillie</a>]soup. Select ( ' a[href$= "Tillie"] ' ) # [<a class=] Sister "href=" Http://example.com/tillie "id=" Link3 ">tillie</a>]soup.< span class= "n" >select ( ' a[href*= ". Com/el"] ' ) # [ <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]
Python module Learning BS4