Python crawler's analytic library beautiful Soup

Last Update:2018-07-11 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beautiful SOUP4 Operation why use Beautiful Soup

Beautiful soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find, modify the way the document,

is a form of a label to be looked up, somewhat like the form of jquery. Improve efficiency, when we are doing crawler development, the process will use the regular to find filtering operations, pure manual and its waste of time.

Beautiful Soup Sample excerpt from official website

Html_doc = "" "
Here is a simple description of the beautiful Soup search method, is a form of a tag tree.
When used, instantiate an object that is equivalent to the entire HTML file, encapsulating the tag as an object's property, and using "." When looking for it.
Here's how to do it: simple operation fromBs4ImportBeautifulsoupsoup= BeautifulSoup (Open ("html_doc.html"),"lxml")#easy to operate#Print the Title property of an HTML file#print (soup.title)#<title>the dormouse ' s story</title>#Print the name of the label#print (soup.title.name)#title#Print the contents of a label#print (soup.title.string)#The dormouse ' s story#Print the P tag in soup, but here's the first one you can find#print (SOUP.P)#<p class= "title" ><b>the dormouse ' s story</b></p>#Print the P tag class name in soup, but here's the first one you can find#Print (soup.p[' class '],type (Soup.p[' class '))#[' title '] <class ' list ' > #类型是个列表#print the A tag in soup, but here's the first one you can find#print (SOUP.A)#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>#Print all the A labels#Print (Soup.find_all (' a '))#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]#Print a id=link3 label#Print (Soup.find (id= "Link3"))#<a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>#Find links to all <a> tags from the documentation:#For link in soup.find_all (' a '):#Print (link.get (' href '))    #Http://example.com/elsie    #Http://example.com/lacie    #Http://example.com/tillie#get all the text from the document:#print (Soup.get_text ())#The dormouse ' s story##The dormouse ' s story##Once Upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and They lived at the bottom for a well.## ...
TagSoup1 = BeautifulSoup ('<b class= ' boldest ">extremely bold</b>'," lxml "  = soup1.b  #  print (type (tag))#  <class ' Bs4.element.Tag ' >
Tag's Name property# print (tag.name) # b # If you change the name of tag, it will affect all HTML documents generated by the current beautiful soup object: # tag.name = "blockquote" # print (TAG) # <blockquote class= "boldest" >extremely bold</blockquote>
tag's Attributes property
 a tag can have many properties. Tag <b class  = " Span style= "color: #800000;" >boldest   > has a" class   #   print (tag[' class ')  #   [' boldest ']  #   #  Span style= "color: #008000;" > print (tag.attrs)  #   {' class ': [' boldest ']}  #   print (soup.a.attrs[' class '])  #   [' sister '] 
  
# The properties of the tag can be added, deleted or modified. Again, tag's properties are manipulated in the same way as dictionaries.

# tag[' class '] = ' verybold '
# tag[' id '] = 1
# print (TAG) 
 # <blockquote class= "Verybold" id= "1" >extremely bold</ Blockquote> 
  
 # del tag[' class '] 
 # del tag[' id '] 
 # print (tag) 
 # <blockquote>extremely bold</ Blockquote> 
  
 # tag[' class '] 
 # keyerror: ' class ' 
 # Print (Tag.get (' class ')) 
 # None    
child node Operations:. Contents Properties#. Contents#tag's. Contents property can output the child nodes of the tag as a list:#print (soup)#print (soup.contents) #这里打印的是整个html标签#print ("________")#print (soup.head.contents) #打印出来的是head下的列表, you can use the tuple to re-##[' \ n ', <meta charset= "Utf-8"/>, ' \ n ', <title>the dormouse ' s story</title>, ' \ n ']#print (len (soup.head.contents))##5#print (soup.head.contents[1].name)##meta


Interpreter:
　　

 
 
  
   
    
    Parser 
    How to use 
    Advantages 
    Disadvantage 
    
  
  
   
    
    Python Standard library 
    BeautifulSoup(markup, "html.parser") 
     
      
      Python's built-in standard library 
      Moderate execution speed 
      Strong document Tolerance 
      
     
      
      Poor document tolerance in versions prior to Python 2.7.3 or 3.2.2 
      
    
    
    lxml HTML Parser 
    BeautifulSoup(markup, "lxml") 
     
      
      Fast speed 
      Strong document Tolerance 
      
     
      
      Need to install the C language Library 
      
    
    
    lxml XML Parser 
    BeautifulSoup(markup, ["lxml-xml"])
BeautifulSoup(markup, "xml") 
     
      
      Fast speed 
      The only parser that supports XML 
      
     
      
      Need to install the C language Library 
      
    
    
    Html5lib 
    BeautifulSoup(markup, "html5lib") 
     
      
      Best-in-tolerance 
      Parsing documents in a browser way 
      Generate documents in HTML5 format 
      
     
      
      Slow speed 
      Do not rely on external extensions 
      
    
  
 
 
　　
　　
Python crawler's analytic library beautiful Soup

Parser	How to use	Advantages	Disadvantage
Python Standard library	`BeautifulSoup(markup, "html.parser")`	Python's built-in standard library Moderate execution speed Strong document Tolerance	Poor document tolerance in versions prior to Python 2.7.3 or 3.2.2
lxml HTML Parser	`BeautifulSoup(markup, "lxml")`	Fast speed Strong document Tolerance	Need to install the C language Library
lxml XML Parser	`BeautifulSoup(markup, ["lxml-xml"])` `BeautifulSoup(markup, "xml")`	Fast speed The only parser that supports XML	Need to install the C language Library
Html5lib	`BeautifulSoup(markup, "html5lib")`	Best-in-tolerance Parsing documents in a browser way Generate documents in HTML5 format	Slow speed Do not rely on external extensions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More