The BS4 of Python module learning

Source: Internet
Author: User

1, installation BS4

I use the ubuntu14.4, just use the Apt-get command.

sudo Install Python-bs4

2. Install the parser

Beautiful soup supports the HTML parser in the Python standard library and also supports some third-party parsers, one of which is lxml.

sudo Install Python-lxml

3. How to use

By passing a document into the BeautifulSoup constructor, you can get a document object that can pass in a string or a file handle.

 from Import  = BeautifulSoup (open ("index.html"= BeautifulSoup ("< html>data")

4. Types of objects

Beautfiful soup transforms complex HTML documents into a complex tree structure, where each node is a Python object, and all objects can be summed up in 4 types: tag,navigablestring,beautifulsoup,comment.

Tag

The tag object is the same as the tag in the XML or HMTL native document:

Soup = BeautifulSoup ('<b class= "boldest" >extremely bold</b>'=  Soup.btype (tag)#  <class ' Bs4.element.Tag ' >

Each tag has its own name, which is obtained through the. Name:

Tag.name # u ' B '

A tag may have many properties.

tag['class']#  u ' boldest '
Tag.attrs # {u ' class ': U ' boldest '}

Navigablestring

Strings are often contained within tags.

tag.string # u ' extremely bold ' type (tag.string) # <class ' bs4.element.NavigableString ' >

BeautifulSoup

The BeautifulSoup object represents the entire contents of a document.

Soup class="boldest">extremely bold</b></body> Type (soup)<class'bs4. BeautifulSoup'>

Comment

Typically, the comment section of the document is represented.

5. Traverse the document Tree

Tag's name

You can get the tag by taking a property, and you can call it multiple times.

Soup.head #  Soup.title # <title>the dormouse ' s story</title>

You can only get the first tag of the current name by taking a property:

Soup.a # <a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>

If you want to get all the A tags

Soup.find_all ('a')#  [<a class= "sister" href= "/HTTP/ Example.com/elsie "id=" Link1 ">ELSIE</A>,#  <a class=" sister "href=" http:// Example.com/lacie "id=" Link2 ">LACIE</A>,#  <a class=" sister "href=" http:// Example.com/tillie "id=" Link3 ">TILLIE</A>]

6. Search the document tree

Beautiful soup The most important search methods are two: Find (), Find_all ().

Filter filters

The simplest filter is the string

Soup.find_all ('b')#  [<b>the dormouse ' s story</b>] 

By passing in a regular expression as a parameter

Import Re  for  in Soup.find_all (Re.compile ("^b")):    Print (Tag.name) # Body # b

Incoming list parameters

Soup.find_all (["a""b"])#  [<b> The Dormouse ' s story</b>,#  <a class= "sister" href= "Http://example.com/elsie" id= " Link1 ">ELSIE</A>,#  <a class=" sister "href=" Http://example.com/lacie "id=" Link2 ">LACIE</A>,#  <a class=" sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]


If there is no suitable filter, you can also customize the method

Find_all ()

Find_all (name, Attrs, recursive, text, **kwargs)

Name parameter

The name parameter can find all tags named name, such as title\head\body\p, etc.

Keyword parameters

If a parameter of the specified name is not a search for the built-in parameter name, the search will search for the parameter as a property of the specified name tag, and if a parameter named ID is included, Beautiful soup will search for each tag's "id" attribute.

Soup.  Find_all(id=' link2 ')# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]       

If the href parameter is passed in, Beautiful soup searches for the "href" attribute of each tag:

Soup.  Find_all(href=re. ) Compile("Elsie"))# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie </a>]           

A parameter value that can be used when searching for a property of a specified name includes a string, a regular expression, a list, and True.

The following example finds all tags that contain the ID attribute in the document tree, regardless of the value of the ID :

Soup.  Find_all(id=True)# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]     

You can filter multiple properties of a tag at the same time by using multiple parameters of the specified name:

Soup.  Find_all(href=re. ) Compile("Elsie"id=' Link1 ')# [<a class= ' sister ' href= ' http://example.com/ Elsie "id=" Link1 ">THREE</A>]             

Search by CSS

Class is class_ in Beatifulsoup because it conflicts with the Python keyword

The class_ parameter also accepts different types of filters , strings, regular expressions, methods, or True

Text parameter

The text parameter can search for string content in a document. As with the optional value of the name parameter, the text parameter accepts a string, a regular expression, a list, and True.

Like callingFind_all ()The same call tag

Find_all () is almost the most commonly used search method in beautiful soup, so we have defined its shorthand method. The beautifulsoup object and the tag object can be used as a method that executes the same as the Find_all () method that invokes the object, and the following two lines of code are equivalent:

Soup.  Find_all("a")soup("a         ")

These two lines of code are also equivalent:

Soup.  Title.  Find_all(text=True)soup.  Title(text=True)           

CSS Selector

Beautiful soup supports most CSS selectors [6], passing string parameters in the . Select () method of the Tag or beautifulsoup object. You can use the CSS selector syntax to find the tag:

Soup.  Select("title")# [<title>the dormouse ' s story</title>]soup.  Select("P Nth-of-type (3)")# [<p class= "story" >...</P>]    

Search by layer by tag tag:

Soup.  Select("Body A")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]soup.  Select("HTML head title")# [<title>the dormouse ' s story</title>]   

Locate the direct sub-label under a tag tag [6]:

Soup.Select("Head > title")# [<title>the dormouse ' s story</title>]Soup.Select("P > a")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" &GT;ELSIE&LT;/A&GT;# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]soup.  Select("p > A:nth-of-type (2)")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]soup.  Select("p > #link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>]soup.  Select("Body > A")# []           

Find the sibling node tag:

Soup.  Select("#link1 ~. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>,# <a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]  Soup.  Select("#link1 +. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>]             

Look through the class name of the CSS:

 soup. Select ( "sister" ) # [<a class= "Sister" href= " Http://example.com/elsie "id=" Link1 ">elsie</a>,# <a class=" sister "href="/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">tillie</a>]soup. Select ( "[Class~=sister]" ) # [<a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href= "/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">TILLIE</A>]           

Find by Tag ID:

Soup.  Select("#link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ]soup.  Select("A#link2")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie </a>]            

Find by whether a property exists:

Soup.  Select(' a[href] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]       

To find by the value of the property:

Soup.Select(' a[href= ' Http://example.com/elsie "] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" &GT;ELSIE&LT;/A&GT;]Soup.Select ( ' a[href^= "http://example.com/"] ' ) # [<a class = "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href = "Http://example.com/lacie" id= "Link2" >lacie</a>,# <a class= "sister" href= "/HTTP/ Example.com/tillie "id=" Link3 ">tillie</a>]soup. Select ( ' a[href$= "Tillie"] ' ) # [<a class=] Sister "href=" Http://example.com/tillie "id=" Link3 ">tillie</a>]soup.< span class= "n" >select ( ' a[href*= ". Com/el"] ' ) # [ <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]     

Python module Learning BS4

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.