Beautiful Soup is a html/xml parser written in Python that handles non-canonical markup and generates a parse tree. It provides simple and common navigation (navigating), search and modify the parse tree operation. It can greatly save your programming time.
installation
1. You can use PIP or Easy_install to install, the following two ways can be
Easy_install Beautifulsoup4pip Install Beautifulsoup4
2. If you want to install the latest version, please download the installation package directly to install manually, it is also a very convenient way.
: https://pypi.python.org/pypi/beautifulsoup4/4.3.2
Unzip after download is complete
Run the following command to complete the installation
sudo python setup.py install
Use
Chinese documents
This article uses the following HTML code to illustrate the use of the BeautifulSoup library, which bs4 any HTML input into UTF‐8 encoding
>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text '
1. References to the BEAUTIFULSOUP4 library
From BS4 import Beautifulsoupimport BS4
2. BEAUTIFULSOUP4 Library Parser
Soup = beautifulsoup (' Data ', ' Html.parser ')
Parser |
How to use |
Conditions |
HTML parser for BS4 |
BeautifulSoup (MK, ' Html.parser ') |
Installing the BS4 Library |
HTML parser for lxml |
BeautifulSoup (MK, ' lxml ') |
Installing the lxml Library |
XML parser for lxml |
BeautifulSoup (MK, ' xml ') |
Installing the lxml Library |
Parser for Html5lib |
BeautifulSoup (MK, ' Html5lib ') |
Installing the Html5lib Library |
3. Basic elements of the BeautifulSoup class
Basic elements |
Description |
Tag |
tags, the most basic information organizational unit, with <> and </> marked the beginning and end |
Name |
The name of the label,<p>...</p> is ' P ', format: <tag>.name |
Attributes |
Label properties, dictionary form organization, format: <tag>.attrs |
Navigablestring |
String in non-attribute string,<>...</> in tag, format: <tag>.string |
Comment |
The annotation part of a string within a tag, a special type of comment |
############## tag tag ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> Soup.title<title>this is a python demo page</title>>>> tag = soup.a>& gt;> tag<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a > Any tags that exist in HTML syntax can be accessed with soup.<tag>,,soup.<tag> returns the first ############## when there are multiple identical <tag> corresponding content in the HTML document Tag name (name) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> soup.a.name ' A ' >>> soup.a.parent.name# parent tag name ' P ' >>> soup.a.parent.parent.name# #父级的父级的标签名称 ' body ' each <tag> has its own name, obtained by <tag>.name, String type ############## tag Attrs (attribute) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ' >>> tag = soup.a>>> tag.attrs{' class ': [' py1 '], ' id ': ' link1 ', ' href ': ' Http://www.icourse163.org/course/bit-268001 '}>>> tag.attrs[' class ' [' Py1 ']>>> tag.attrs[' href '] '/HTTP/ www.icourse163.org/course/BIT-268001 ' >>> type (tag.attrs) <class ' dict ' >>>> type (TAG) < Class ' Bs4.element.Tag ' > A <tag> can have 0 or more properties, dictionary type ############## Tag navigablestring ##############>> > Soup.a<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a >>>> soup.a.string ' Basic python ' >>> soup.title<title>this is a Python demo page</title >>>> soup.title.string ' This is a Python demo page ' >>> type (soup.title.string) <class ' Bs4.element.NavigableString ' >navigablestring can span multiple levels ############## tag comment ##############>>> Newsoup = BeautifulSoup (' <b><!--This was comment--></b><p>this is comment</p> ', ' Html.parser ') >>> newsoup.b.string ' This is comment ' >>> type (newsoup.b.string) <class ' BS4.ELement.comment ' >>>> newsoup.p.string ' This is Comment ' >>> type (newsoup.p.string) <class ' Bs4.element.NavigableString ' >comment is a special type
4. Traversal of the tag tree
BeautifulSoup type is the root node of the tag tree
1) downlink traversal
Property |
Description |
. contents |
List of child nodes, save <tag> all sons node into list |
. Children |
The iteration type of the child node, similar to the. Contents, for looping through the son node |
. Descendants |
The iteration type of the descendant node, which contains all descendant nodes for looping through |
>>> Soup.head2) upstream traversal
Property |
Description |
. Parent |
Father tag of the node |
. Parents |
Iteration type of the ancestor tag of the node, used to iterate through ancestors ' nodes |
>>> Soup.title.parent3) Parallel traversal
Property |
Description |
. next_sibling |
Returns the next Parallel node label in HTML text order |
. previous_sibling |
Returns the previous parallel node label in HTML text order |
. next_siblings |
Iteration type, which returns all subsequent parallel node labels in HTML text order |
. previous_siblings |
Iteration type, returning all parallel node labels in HTML text order |
>>> soup.a.next_sibling ' and ' >>> soup.a.next_sibling.next_sibling<a class= "py2" href= "/HTTP/ www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>>>> soup.a.previous_ Sibling ' Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n ' #遍历后续节点for sibling in Soup.a.next_ Sibling:print (sibling) #遍历前续节点for sibling in Soup.a.previous_sibling:print (sibling)
5. More friendly output using the Prettify method
>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text ' 6. Content Lookup
Method |
Description |
<>.find () |
Search and return only one result, the same as the. Find_all () parameter |
<>.find_parents () |
Search in ancestor node, return list type, same. Find_all () parameter |
<>.find_parent () |
Returns a result in the ancestor node with the. Find () parameter |
<>.find_next_siblings () |
Search in subsequent parallel nodes, return list type, same. Find_all () parameter |
<>.find_next_sibling () |
Returns a result in subsequent parallel nodes, with the. Find () parameter |
<>.find_previous_siblings () |
Search in the pre-ordered parallel nodes, return the list type, with the. Find_all () parameter |
<>.find_previous_sibling () |
Returns a result, with the. Find () parameter, in the parallel node of the preceding sequence |
<>.find_all (name, Attrs, recursive, string, **kwargs)
- Name: Retrieves a string for the label name
- Attrs: Retrieving strings for Tag property values, labeling attribute retrieval
- Recursive: Whether to retrieve all descendants, default True
- String: Retrieving strings for the string range in <>...</>
<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) equivalent to Soup.find_all (..)
#提取所有URLfor link in soup.find_all (' a '): Print (Link.get (' href ')) http://www.icourse163.org/course/BIT-268001http:// Www.icourse163.org/course/bit-1001870001>>> Soup.find_all (' a ') #查找所有的a标签, returns a list [<a class= "Py1" href= " http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all ([' A ' , ' B ']) #查找所有的a标签和b标签, returns a list [<b>the demo Python introduces several Python courses.</b> <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a>, <a class= "Py2" href= " http://www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_ All (' P ', ' Course ') #查找所有具有course属性的p标签, [<p class= "course" >python are a wonderful general-purpose programming Language. You can learn Python from novice to professional by tracking the following courses: <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a> and <a class= "Py2" href= "http://www.icourse163.org/course/BIT-1001870001" id= "Link2" >advanced python</a>.</p >]>>> soup.find_all (id= ' link1 ') #查找所有属性为id = label for ' Link1 ' [<a class= ' py1 ' href= '/HTTP/ www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a>]>>> import re# use regular lookup in all ID attributes to ' Link ' start label >>> soup.find_all (Id=re.compile (' link ')) [<a class= "Py1" href= "http://www.icourse163.org/ course/bit-268001 "id=" Link1 ">basic python</a> <a class=" Py2 "href=" http://www.icourse163.org/course/ BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' a ') [<a class=" Py1 "href=" http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' A', Recursive=false) #查找所有a标签且不遍历子孙节点 []>>> soup.find_all (string= "Basic Python") #查找所有标签间内容为 "Basic Python" string [' Basic Python ']>>> import re>>> soup.find_all (string=re.compile (' Python ')) [' This is a python Demo page ', ' The demo Python introduces several Python courses. ' A string #查找所有标签间内容中有 "Python"
Basic use of the Python BeautifulSoup library