International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

The BS4 of Python module learning

Last Update:2015-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, installation BS4

I use the ubuntu14.4, just use the Apt-get command.

sudo Install Python-bs4

2. Install the parser

Beautiful soup supports the HTML parser in the Python standard library and also supports some third-party parsers, one of which is lxml.

sudo Install Python-lxml

3. How to use

By passing a document into the BeautifulSoup constructor, you can get a document object that can pass in a string or a file handle.

 from Import  = BeautifulSoup (open ("index.html"= BeautifulSoup ("< html>data")

4. Types of objects

Beautfiful soup transforms complex HTML documents into a complex tree structure, where each node is a Python object, and all objects can be summed up in 4 types: tag,navigablestring,beautifulsoup,comment.

Tag

The tag object is the same as the tag in the XML or HMTL native document:

Soup = BeautifulSoup ('<b class= "boldest" >extremely bold</b>'=  Soup.btype (tag)#  <class ' Bs4.element.Tag ' >

Each tag has its own name, which is obtained through the. Name:

Tag.name # u ' B '

A tag may have many properties.

tag['class']#  u ' boldest '

Tag.attrs # {u ' class ': U ' boldest '}

Navigablestring

Strings are often contained within tags.

tag.string # u ' extremely bold ' type (tag.string) # <class ' bs4.element.NavigableString ' >

BeautifulSoup

The BeautifulSoup object represents the entire contents of a document.

Soup class="boldest">extremely bold</b></body> Type (soup)<class'bs4. BeautifulSoup'>

Comment

Typically, the comment section of the document is represented.

5. Traverse the document Tree

Tag's name

You can get the tag by taking a property, and you can call it multiple times.

Soup.head #  Soup.title # <title>the dormouse ' s story</title>

You can only get the first tag of the current name by taking a property:

Soup.a # <a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>

If you want to get all the A tags

Soup.find_all ('a')#  [<a class= "sister" href= "/HTTP/ Example.com/elsie "id=" Link1 ">ELSIE</A>,#  <a class=" sister "href=" http:// Example.com/lacie "id=" Link2 ">LACIE</A>,#  <a class=" sister "href=" http:// Example.com/tillie "id=" Link3 ">TILLIE</A>]

6. Search the document tree

Beautiful soup The most important search methods are two: Find (), Find_all ().

Filter filters

The simplest filter is the string

Soup.find_all ('b')#  [<b>the dormouse ' s story</b>]

By passing in a regular expression as a parameter

Import Re  for  in Soup.find_all (Re.compile ("^b")):    Print (Tag.name) # Body # b

Incoming list parameters

Soup.find_all (["a""b"])#  [<b> The Dormouse ' s story</b>,#  <a class= "sister" href= "Http://example.com/elsie" id= " Link1 ">ELSIE</A>,#  <a class=" sister "href=" Http://example.com/lacie "id=" Link2 ">LACIE</A>,#  <a class=" sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]

If there is no suitable filter, you can also customize the method

Find_all ()

Find_all (name, Attrs, recursive, text, **kwargs)

Name parameter

The name parameter can find all tags named name, such as title\head\body\p, etc.

Keyword parameters

If a parameter of the specified name is not a search for the built-in parameter name, the search will search for the parameter as a property of the specified name tag, and if a parameter named ID is included, Beautiful soup will search for each tag's "id" attribute.

Soup.  Find_all(id=' link2 ')# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]

If the href parameter is passed in, Beautiful soup searches for the "href" attribute of each tag:

Soup.  Find_all(href=re. ) Compile("Elsie"))# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie </a>]

A parameter value that can be used when searching for a property of a specified name includes a string, a regular expression, a list, and True.

The following example finds all tags that contain the ID attribute in the document tree, regardless of the value of the ID :

Soup.  Find_all(id=True)# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]

You can filter multiple properties of a tag at the same time by using multiple parameters of the specified name:

Soup.  Find_all(href=re. ) Compile("Elsie"id=' Link1 ')# [<a class= ' sister ' href= ' http://example.com/ Elsie "id=" Link1 ">THREE</A>]

Search by CSS

Class is class_ in Beatifulsoup because it conflicts with the Python keyword

The class_ parameter also accepts different types of filters , strings, regular expressions, methods, or True

Text parameter

The text parameter can search for string content in a document. As with the optional value of the name parameter, the text parameter accepts a string, a regular expression, a list, and True.

Like callingFind_all ()The same call tag

Find_all () is almost the most commonly used search method in beautiful soup, so we have defined its shorthand method. The beautifulsoup object and the tag object can be used as a method that executes the same as the Find_all () method that invokes the object, and the following two lines of code are equivalent:

Soup.  Find_all("a")soup("a         ")

These two lines of code are also equivalent:

Soup.  Title.  Find_all(text=True)soup.  Title(text=True)

CSS Selector

Beautiful soup supports most CSS selectors [6], passing string parameters in the . Select () method of the Tag or beautifulsoup object. You can use the CSS selector syntax to find the tag:

Soup.  Select("title")# [<title>the dormouse ' s story</title>]soup.  Select("P Nth-of-type (3)")# [<p class= "story" >...</P>]

Search by layer by tag tag:

Soup.  Select("Body A")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]soup.  Select("HTML head title")# [<title>the dormouse ' s story</title>]

Locate the direct sub-label under a tag tag [6]:

Soup.Select("Head > title")# [<title>the dormouse ' s story</title>]Soup.Select("P > a")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" &GT;ELSIE&LT;/A&GT;# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= "Sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]soup.  Select("p > A:nth-of-type (2)")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie</a>]soup.  Select("p > #link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" > Elsie</a>]soup.  Select("Body > A")# []

Find the sibling node tag:

Soup.  Select("#link1 ~. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>,# <a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]  Soup.  Select("#link1 +. Sister")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" > Lacie</a>]

Look through the class name of the CSS:

 soup. Select ( "sister" ) # [<a class= "Sister" href= " Http://example.com/elsie "id=" Link1 ">elsie</a>,# <a class=" sister "href="/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">tillie</a>]soup. Select ( "[Class~=sister]" ) # [<a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href= "/HTTP/ Example.com/lacie "id=" Link2 ">lacie</a>,# <a class=" sister "href=" http://example.com/ Tillie "id=" Link3 ">TILLIE</A>]

Find by Tag ID:

Soup.  Select("#link1")# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ]soup.  Select("A#link2")# [<a class= "sister" href= "Http://example.com/lacie" id= "Link2" >lacie </a>]

Find by whether a property exists:

Soup.  Select(' a[href] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a ,# <a class= "sister" href= "Http://example.com/lacie" id= "Link2" >LACIE</A>,# <a class= " Sister "href=" Http://example.com/tillie "id=" Link3 ">TILLIE</A>]

To find by the value of the property:

Soup.Select(' a[href= ' Http://example.com/elsie "] ')# [<a class= "sister" href= "Http://example.com/elsie" id= "Link1" &GT;ELSIE&LT;/A&GT;]Soup.Select ( ' a[href^= "http://example.com/"] ' ) # [<a class = "Sister" href= "Http://example.com/elsie" id= "Link1" >elsie</a>,# <a class= "sister" href = "Http://example.com/lacie" id= "Link2" >lacie</a>,# <a class= "sister" href= "/HTTP/ Example.com/tillie "id=" Link3 ">tillie</a>]soup. Select ( ' a[href$= "Tillie"] ' ) # [<a class=] Sister "href=" Http://example.com/tillie "id=" Link3 ">tillie</a>]soup.< span class= "n" >select ( ' a[href*= ". Com/el"] ' ) # [ <a class= "Sister" href= "Http://example.com/elsie" id= "Link1" >ELSIE</A>]

Python module Learning BS4

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python design mode-UML-Package diagrams (Package Diagram) 09-09

The difference between OS and sys two modules in Python 04-05

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The BS4 of Python module learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support