Python crawler's analytic library beautiful Soup

Source: Internet
Author: User
Tags xml parser

Beautiful SOUP4 Operation why use Beautiful Soup

Beautiful soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find, modify the way the document,

is a form of a label to be looked up, somewhat like the form of jquery. Improve efficiency, when we are doing crawler development, the process will use the regular to find filtering operations, pure manual and its waste of time.

Beautiful Soup Sample excerpt from official website
Html_doc = "" "

Here is a simple description of the beautiful Soup search method, is a form of a tag tree.

When used, instantiate an object that is equivalent to the entire HTML file, encapsulating the tag as an object's property, and using "." When looking for it.

Here's how to do it: simple operation
 fromBs4ImportBeautifulsoupsoup= BeautifulSoup (Open ("html_doc.html"),"lxml")#easy to operate#Print the Title property of an HTML file#print (soup.title)#<title>the dormouse ' s story</title>#Print the name of the label#print (soup.title.name)#title#Print the contents of a label#print (soup.title.string)#The dormouse ' s story#Print the P tag in soup, but here's the first one you can find#print (SOUP.P)#<p class= "title" ><b>the dormouse ' s story</b></p>#Print the P tag class name in soup, but here's the first one you can find#Print (soup.p[' class '],type (Soup.p[' class '))#[' title '] <class ' list ' > #类型是个列表#print the A tag in soup, but here's the first one you can find#print (SOUP.A)#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>#Print all the A labels#Print (Soup.find_all (' a '))#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]#Print a id=link3 label#Print (Soup.find (id= "Link3"))#<a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>#Find links to all <a> tags from the documentation:#For link in soup.find_all (' a '):#Print (link.get (' href '))    #Http://example.com/elsie    #Http://example.com/lacie    #Http://example.com/tillie#get all the text from the document:#print (Soup.get_text ())#The dormouse ' s story##The dormouse ' s story##Once Upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and They lived at the bottom for a well.## ...
Tag
Soup1 = BeautifulSoup ('<b class= ' boldest ">extremely bold</b>'," lxml "  = soup1.b  #  print (type (tag))#  <class ' Bs4.element.Tag ' >
Tag's Name property
# print (tag.name) # b # If you change the name of tag, it will affect all HTML documents generated by the current beautiful soup object: # tag.name = "blockquote" # print (TAG) # <blockquote class= "boldest" >extremely bold</blockquote>
tag's Attributes property

 a tag can have many properties. Tag <b class  = " Span style= "color: #800000;" >boldest   > has a" class   #   print (tag[' class ')  #   [' boldest ']  #   #  Span style= "color: #008000;" > print (tag.attrs)  #   {' class ': [' boldest ']}  #   print (soup.a.attrs[' class '])  #   [' sister '] 
# The properties of the tag can be added, deleted or modified. Again, tag's properties are manipulated in the same way as dictionaries.

# tag[' class '] = ' verybold '
# tag[' id '] = 1
# print (TAG)
# <blockquote class= "Verybold" id= "1" >extremely bold</ Blockquote>

# del tag[' class ']
# del tag[' id ']
# print (tag)
# <blockquote>extremely bold</ Blockquote>

# tag[' class ']
# keyerror: ' class '
# Print (Tag.get (' class '))
# None

child node Operations:. Contents Properties
#. Contents#tag's. Contents property can output the child nodes of the tag as a list:#print (soup)#print (soup.contents) #这里打印的是整个html标签#print ("________")#print (soup.head.contents) #打印出来的是head下的列表, you can use the tuple to re-##[' \ n ', <meta charset= "Utf-8"/>, ' \ n ', <title>the dormouse ' s story</title>, ' \ n ']#print (len (soup.head.contents))##5#print (soup.head.contents[1].name)##meta


Interpreter:

  

Parser How to use Advantages Disadvantage
Python Standard library BeautifulSoup(markup, "html.parser")
  • Python's built-in standard library
  • Moderate execution speed
  • Strong document Tolerance
  • Poor document tolerance in versions prior to Python 2.7.3 or 3.2.2
lxml HTML Parser BeautifulSoup(markup, "lxml")
  • Fast speed
  • Strong document Tolerance
  • Need to install the C language Library
lxml XML Parser

BeautifulSoup(markup, ["lxml-xml"])

BeautifulSoup(markup, "xml")

  • Fast speed
  • The only parser that supports XML
  • Need to install the C language Library
Html5lib BeautifulSoup(markup, "html5lib")
  • Best-in-tolerance
  • Parsing documents in a browser way
  • Generate documents in HTML5 format
  • Slow speed
  • Do not rely on external extensions

  

  

Python crawler's analytic library beautiful Soup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.