Basic use of the Python BeautifulSoup library

Source: Internet
Author: User
Tags tag name xml parser

Beautiful Soup is a html/xml parser written in Python that handles non-canonical markup and generates a parse tree. It provides simple and common navigation (navigating), search and modify the parse tree operation. It can greatly save your programming time.

installation

1. You can use PIP or Easy_install to install, the following two ways can be

Easy_install Beautifulsoup4pip Install Beautifulsoup4

2. If you want to install the latest version, please download the installation package directly to install manually, it is also a very convenient way.

: https://pypi.python.org/pypi/beautifulsoup4/4.3.2

Unzip after download is complete

Run the following command to complete the installation

sudo python setup.py install
Use

Chinese documents

This article uses the following HTML code to illustrate the use of the BeautifulSoup library, which bs4 any HTML input into UTF‐8 encoding

>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text ' 

1. References to the BEAUTIFULSOUP4 library

From BS4 import Beautifulsoupimport BS4

2. BEAUTIFULSOUP4 Library Parser

Soup = beautifulsoup (' Data ', ' Html.parser ')
Parser How to use Conditions
HTML parser for BS4 BeautifulSoup (MK, ' Html.parser ') Installing the BS4 Library
HTML parser for lxml BeautifulSoup (MK, ' lxml ') Installing the lxml Library
XML parser for lxml BeautifulSoup (MK, ' xml ') Installing the lxml Library
Parser for Html5lib BeautifulSoup (MK, ' Html5lib ') Installing the Html5lib Library

3. Basic elements of the BeautifulSoup class

Basic elements Description
Tag tags, the most basic information organizational unit, with <> and </> marked the beginning and end
Name The name of the label,<p>...</p> is ' P ', format: <tag>.name
Attributes Label properties, dictionary form organization, format: <tag>.attrs
Navigablestring String in non-attribute string,<>...</> in tag, format: <tag>.string
Comment The annotation part of a string within a tag, a special type of comment
############## tag tag ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> Soup.title<title>this is a python demo page</title>>>> tag = soup.a>& gt;> tag<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a  > Any tags that exist in HTML syntax can be accessed with soup.<tag>,,soup.<tag> returns the first ############## when there are multiple identical <tag> corresponding content in the HTML document Tag name (name) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ') >>> soup.a.name ' A ' >>> soup.a.parent.name# parent tag name ' P ' >>> soup.a.parent.parent.name# #父级的父级的标签名称 ' body ' each <tag> has its own name, obtained by <tag>.name, String type ############## tag Attrs (attribute) ##############>>> from BS4 import beautifulsoup>>> soup = BeautifulSoup (demo, ' Html.parser ' >>> tag = soup.a>>> tag.attrs{' class ': [' py1 '], ' id ': ' link1 ', ' href ': ' Http://www.icourse163.org/course/bit-268001 '}>>> tag.attrs[' class ' [' Py1 ']>>> tag.attrs[' href '] '/HTTP/ www.icourse163.org/course/BIT-268001 ' >>> type (tag.attrs) <class ' dict ' >>>> type (TAG) < Class ' Bs4.element.Tag ' > A <tag> can have 0 or more properties, dictionary type ############## Tag navigablestring ##############>> > Soup.a<a class= "py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a >>>> soup.a.string ' Basic python ' >>> soup.title<title>this is a Python demo page</title >>>> soup.title.string ' This is a Python demo page ' >>> type (soup.title.string) <class ' Bs4.element.NavigableString ' >navigablestring can span multiple levels ############## tag comment ##############>>> Newsoup = BeautifulSoup (' <b><!--This was comment--></b><p>this is comment</p> ', ' Html.parser ') >>> newsoup.b.string ' This is comment ' >>> type (newsoup.b.string) <class ' BS4.ELement.comment ' >>>> newsoup.p.string ' This is Comment ' >>> type (newsoup.p.string) <class ' Bs4.element.NavigableString ' >comment is a special type

4. Traversal of the tag tree

BeautifulSoup type is the root node of the tag tree

1) downlink traversal

Property Description
. contents List of child nodes, save <tag> all sons node into list
. Children The iteration type of the child node, similar to the. Contents, for looping through the son node
. Descendants The iteration type of the descendant node, which contains all descendant nodes for looping through
>>> Soup.head

2) upstream traversal

Property Description
. Parent Father tag of the node
. Parents Iteration type of the ancestor tag of the node, used to iterate through ancestors ' nodes
>>> Soup.title.parent

3) Parallel traversal

Property Description
. next_sibling Returns the next Parallel node label in HTML text order
. previous_sibling Returns the previous parallel node label in HTML text order
. next_siblings Iteration type, which returns all subsequent parallel node labels in HTML text order
. previous_siblings Iteration type, returning all parallel node labels in HTML text order
>>> soup.a.next_sibling ' and ' >>> soup.a.next_sibling.next_sibling<a class= "py2" href= "/HTTP/ www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>>>> soup.a.previous_ Sibling ' Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n ' #遍历后续节点for sibling in Soup.a.next_ Sibling:print (sibling) #遍历前续节点for sibling in Soup.a.previous_sibling:print (sibling)

5. More friendly output using the Prettify method

>>> Import requests>>> r = Requests.get (' http://python123.io/ws/demo.html ') >>> r.text ' 

6. Content Lookup

Method Description
<>.find () Search and return only one result, the same as the. Find_all () parameter
<>.find_parents () Search in ancestor node, return list type, same. Find_all () parameter
<>.find_parent () Returns a result in the ancestor node with the. Find () parameter
<>.find_next_siblings () Search in subsequent parallel nodes, return list type, same. Find_all () parameter
<>.find_next_sibling () Returns a result in subsequent parallel nodes, with the. Find () parameter
<>.find_previous_siblings () Search in the pre-ordered parallel nodes, return the list type, with the. Find_all () parameter
<>.find_previous_sibling () Returns a result, with the. Find () parameter, in the parallel node of the preceding sequence

<>.find_all (name, Attrs, recursive, string, **kwargs)

    • Name: Retrieves a string for the label name
    • Attrs: Retrieving strings for Tag property values, labeling attribute retrieval
    • Recursive: Whether to retrieve all descendants, default True
    • String: Retrieving strings for the string range in <>...</>

<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) equivalent to Soup.find_all (..)

#提取所有URLfor link in soup.find_all (' a '): Print (Link.get (' href ')) http://www.icourse163.org/course/BIT-268001http:// Www.icourse163.org/course/bit-1001870001>>> Soup.find_all (' a ') #查找所有的a标签, returns a list [<a class= "Py1" href= " http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all ([' A ' , ' B ']) #查找所有的a标签和b标签, returns a list [<b>the demo Python introduces several Python courses.</b> <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a>, <a class= "Py2" href= " http://www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_ All (' P ', ' Course ') #查找所有具有course属性的p标签, [<p class= "course" >python are a wonderful general-purpose programming Language. You can learn Python from novice to professional by tracking the following courses: <a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a> and <a class= "Py2" href= "http://www.icourse163.org/course/BIT-1001870001" id= "Link2" >advanced python</a>.</p >]>>> soup.find_all (id= ' link1 ') #查找所有属性为id = label for ' Link1 ' [<a class= ' py1 ' href= '/HTTP/ www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a>]>>> import re# use regular lookup in all ID attributes to ' Link ' start label >>> soup.find_all (Id=re.compile (' link ')) [<a class= "Py1" href= "http://www.icourse163.org/ course/bit-268001 "id=" Link1 ">basic python</a> <a class=" Py2 "href=" http://www.icourse163.org/course/ BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' a ') [<a class=" Py1 "href=" http://www.icourse163.org/course/BIT-268001 "id=" Link1 ">basic python</a> <a class=" py2 "href=" http:// www.icourse163.org/course/BIT-1001870001 "id=" Link2 ">advanced python</a>]>>> soup.find_all (' A', Recursive=false) #查找所有a标签且不遍历子孙节点 []>>> soup.find_all (string= "Basic Python") #查找所有标签间内容为 "Basic Python" string [' Basic Python ']>>> import re>>> soup.find_all (string=re.compile (' Python ')) [' This is a python Demo page ', ' The demo Python introduces several Python courses. ' A string #查找所有标签间内容中有 "Python"

Basic use of the Python BeautifulSoup library

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.