Basic use of the Python BeautifulSoup library

Last Update:2017-10-14 Source: Internet

Author: User

Tags tag name xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beautiful Soup is a html/xml parser written in Python that handles non-canonical markup and generates a parse tree. It provides simple and common navigation (navigating), search and modify the parse tree operation. It can greatly save your programming time.

installation

1. You can use PIP or Easy_install to install, the following two ways can be

Easy_install Beautifulsoup4pip Install Beautifulsoup4

2. If you want to install the latest version, please download the installation package directly to install manually, it is also a very convenient way.

: https://pypi.python.org/pypi/beautifulsoup4/4.3.2

Unzip after download is complete

Run the following command to complete the installation

sudo python setup.py install

Use

Chinese documents

This article uses the following HTML code to illustrate the use of the BeautifulSoup library, which bs4 any HTML input into UTF‐8 encoding

>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text ' 
1. References to the BEAUTIFULSOUP4 library
From BS4 import Beautifulsoupimport BS4
2. BEAUTIFULSOUP4 Library Parser
Soup = beautifulsoup (' Data ', ' Html.parser ')
 
 
  
   
    
    Parser 
    How to use 
    Conditions 
    
    
    HTML parser for BS4 
    BeautifulSoup (MK, ' Html.parser ') 
    Installing the BS4 Library 
    
    
    HTML parser for lxml 
    BeautifulSoup (MK, ' lxml ') 
    Installing the lxml Library 
    
    
    XML parser for lxml 
    BeautifulSoup (MK, ' xml ') 
    Installing the lxml Library 
    
    
    Parser for Html5lib 
    BeautifulSoup (MK, ' Html5lib ') 
    Installing the Html5lib Library 
    
  
 
 
3. Basic elements of the BeautifulSoup class

 
 
  
   
    
    Basic elements 
    Description 
    
    
    Tag 
    tags, the most basic information organizational unit, with <> and </> marked the beginning and end 
    
    
    Name 
    The name of the label,<p>...</p> is ' P ', format: <tag>.name 
    
    
    Attributes 
    Label properties, dictionary form organization, format: <tag>.attrs 
    
    
    Navigablestring 
    String in non-attribute string,<>...</> in tag, format: <tag>.string 
    
    
    Comment 
    The annotation part of a string within a tag, a special type of comment 
    
  
 
 
############## tag tag ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> Soup.title<title>this is a python demo page</title>>>> tag = soup.a>& gt;> tag<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a  > Any tags that exist in HTML syntax can be accessed with soup.<tag>,,soup.<tag> returns the first ############## when there are multiple identical <tag> corresponding content in the HTML document Tag name (name) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> soup.a.name ' A ' >>> soup.a.parent.name# parent tag name ' P ' >>> soup.a.parent.parent.name# #父级的父级的标签名称 ' body ' each <tag> has its own name, obtained by <tag>.name, String type ############## tag Attrs (attribute) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ' >>> tag = soup.a>>> tag.attrs{' class ': [' py1 '], ' id ': ' link1 ', ' href ': ' Http://www.icourse163.org/course/bit-268001 '}>>> tag.attrs[' class ' [' Py1 ']>>> tag.attrs[' href '] '/HTTP/ www.icourse163.org/course/BIT-268001 ' >>> type (tag.attrs) <class ' dict ' >>>> type (TAG) < Class ' Bs4.element.Tag ' > A <tag> can have 0 or more properties, dictionary type ############## Tag navigablestring ##############>> > Soup.a<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a >>>> soup.a.string ' Basic python ' >>> soup.title<title>this is a Python demo page</title >>>> soup.title.string ' This is a Python demo page ' >>> type (soup.title.string) <class ' Bs4.element.NavigableString ' >navigablestring can span multiple levels ############## tag comment ##############>>> Newsoup = BeautifulSoup (' <b><!--This was comment--></b><p>this is comment</p> ', ' Html.parser ') >>> newsoup.b.string ' This is comment ' >>> type (newsoup.b.string) <class ' BS4.ELement.comment ' >>>> newsoup.p.string ' This is Comment ' >>> type (newsoup.p.string) <class ' Bs4.element.NavigableString ' >comment is a special type
4. Traversal of the tag tree
BeautifulSoup type is the root node of the tag tree
1) downlink traversal

 
  
  
   
    
    Property 
    Description 
    
    
    . contents 
    List of child nodes, save <tag> all sons node into list 
    
    
    . Children 
    The iteration type of the child node, similar to the. Contents, for looping through the son node 
    
    
    . Descendants 
    The iteration type of the descendant node, which contains all descendant nodes for looping through 
    
   
 
 
>>> Soup.head
2) upstream traversal 

 
   
  
    
     
     Property 
     Description 
     
     
     . Parent 
     Father tag of the node 
     
     
     . Parents 
     Iteration type of the ancestor tag of the node, used to iterate through ancestors ' nodes 
     
    
 
  
>>> Soup.title.parent
3) Parallel traversal

 
    
  
     
      
      Property 
      Description 
      
      
      . next_sibling 
      Returns the next Parallel node label in HTML text order 
      
      
      . previous_sibling 
      Returns the previous parallel node label in HTML text order 
      
      
      . next_siblings 
      Iteration type, which returns all subsequent parallel node labels in HTML text order 
      
      
      . previous_siblings 
      Iteration type, returning all parallel node labels in HTML text order 
      
     
 
   
>>> soup.a.next_sibling ' and ' >>> soup.a.next_sibling.next_sibling<a class= "py2" href= "/HTTP/ www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>>>> soup.a.previous_ Sibling ' Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n ' #遍历后续节点for sibling in Soup.a.next_ Sibling:print (sibling) #遍历前续节点for sibling in Soup.a.previous_sibling:print (sibling)
5. More friendly output using the Prettify method
>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text ' 
6. Content Lookup

 
    
  
      
       
       Method 
       Description 
       
       
       <>.find () 
       Search and return only one result, the same as the. Find_all () parameter 
       
       
       <>.find_parents () 
       Search in ancestor node, return list type, same. Find_all () parameter 
       
       
       <>.find_parent () 
       Returns a result in the ancestor node with the. Find () parameter 
       
       
       <>.find_next_siblings () 
       Search in subsequent parallel nodes, return list type, same. Find_all () parameter 
       
       
       <>.find_next_sibling () 
       Returns a result in subsequent parallel nodes, with the. Find () parameter 
       
       
       <>.find_previous_siblings () 
       Search in the pre-ordered parallel nodes, return the list type, with the. Find_all () parameter 
       
       
       <>.find_previous_sibling () 
       Returns a result, with the. Find () parameter, in the parallel node of the preceding sequence 
       
     
 
    
<>.find_all (name, Attrs, recursive, string, **kwargs)

 
    
  
      
      Name: Retrieves a string for the label name 
      Attrs: Retrieving strings for Tag property values, labeling attribute retrieval 
      Recursive: Whether to retrieve all descendants, default True 
      String: Retrieving strings for the string range in <>...</> 
     
 
    
<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) equivalent to Soup.find_all (..)
#提取所有URLfor link in soup.find_all (' a '): Print (Link.get (' href ')) http://www.icourse163.org/course/BIT-268001http:// Www.icourse163.org/course/bit-1001870001>>> Soup.find_all (' a ') #查找所有的a标签, returns a list [<a class= "Py1" href= " http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all ([' A ' , ' B ']) #查找所有的a标签和b标签, returns a list [<b>the demo Python introduces several Python courses.</b> <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a>, <a class= "Py2" href= " http://www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_ All (' P ', ' Course ') #查找所有具有course属性的p标签, [<p class= "course" >python are a wonderful general-purpose programming Language. You can learn Python from novice to professional by tracking the following courses: <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a> and <a class= "Py2" href= "http://www.icourse163.org/course/BIT-1001870001" id= "Link2" >advanced python</a>.</p >]>>> soup.find_all (id= ' link1 ') #查找所有属性为id = label for ' Link1 ' [<a class= ' py1 ' href= '/HTTP/ www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a>]>>> import re# use regular lookup in all ID attributes to ' Link ' start label >>> soup.find_all (Id=re.compile (' link ')) [<a class= "Py1" href= "http://www.icourse163.org/ course/bit-268001 "id=" Link1 ">basic python</a> <a class=" Py2 "href=" http://www.icourse163.org/course/ BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' a ') [<a class=" Py1 "href=" http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' A', Recursive=false) #查找所有a标签且不遍历子孙节点 []>>> soup.find_all (string= "Basic Python") #查找所有标签间内容为 "Basic Python" string [' Basic Python ']>>> import re>>> soup.find_all (string=re.compile (' Python ')) [' This is a python Demo page ', ' The demo Python introduces several Python courses. ' A string #查找所有标签间内容中有 "Python"
Basic use of the Python BeautifulSoup library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic use of the Python BeautifulSoup library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Parser	How to use	Conditions
HTML parser for BS4	BeautifulSoup (MK, ' Html.parser ')	Installing the BS4 Library
HTML parser for lxml	BeautifulSoup (MK, ' lxml ')	Installing the lxml Library
XML parser for lxml	BeautifulSoup (MK, ' xml ')	Installing the lxml Library
Parser for Html5lib	BeautifulSoup (MK, ' Html5lib ')	Installing the Html5lib Library

Basic elements	Description
Tag	tags, the most basic information organizational unit, with <> and </> marked the beginning and end
Name	The name of the label,<p>...</p> is ' P ', format: <tag>.name
Attributes	Label properties, dictionary form organization, format: <tag>.attrs
Navigablestring	String in non-attribute string,<>...</> in tag, format: <tag>.string
Comment	The annotation part of a string within a tag, a special type of comment

Property	Description
. contents	List of child nodes, save <tag> all sons node into list
. Children	The iteration type of the child node, similar to the. Contents, for looping through the son node
. Descendants	The iteration type of the descendant node, which contains all descendant nodes for looping through

Property	Description
. Parent	Father tag of the node
. Parents	Iteration type of the ancestor tag of the node, used to iterate through ancestors ' nodes

Property	Description
. next_sibling	Returns the next Parallel node label in HTML text order
. previous_sibling	Returns the previous parallel node label in HTML text order
. next_siblings	Iteration type, which returns all subsequent parallel node labels in HTML text order
. previous_siblings	Iteration type, returning all parallel node labels in HTML text order

Method	Description
<>.find ()	Search and return only one result, the same as the. Find_all () parameter
<>.find_parents ()	Search in ancestor node, return list type, same. Find_all () parameter
<>.find_parent ()	Returns a result in the ancestor node with the. Find () parameter
<>.find_next_siblings ()	Search in subsequent parallel nodes, return list type, same. Find_all () parameter
<>.find_next_sibling ()	Returns a result in subsequent parallel nodes, with the. Find () parameter
<>.find_previous_siblings ()	Search in the pre-ordered parallel nodes, return the list type, with the. Find_all () parameter
<>.find_previous_sibling ()	Returns a result, with the. Find () parameter, in the parallel node of the preceding sequence

Basic use of the Python BeautifulSoup library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support