Beautiful SOUP4 Operation why use Beautiful Soup
Beautiful soup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find, modify the way the document,
is a form of a label to be looked up, somewhat like the form of jquery. Improve efficiency, when we are doing crawler development, the process will use the regular to find filtering operations, pure manual and its waste of time.
Beautiful Soup Sample excerpt from official website
Html_doc = "" "
Here is a simple description of the beautiful Soup search method, is a form of a tag tree.
When used, instantiate an object that is equivalent to the entire HTML file, encapsulating the tag as an object's property, and using "." When looking for it.
Here's how to do it: simple operation fromBs4ImportBeautifulsoupsoup= BeautifulSoup (Open ("html_doc.html"),"lxml")#easy to operate#Print the Title property of an HTML file#print (soup.title)#<title>the dormouse ' s story</title>#Print the name of the label#print (soup.title.name)#title#Print the contents of a label#print (soup.title.string)#The dormouse ' s story#Print the P tag in soup, but here's the first one you can find#print (SOUP.P)#<p class= "title" ><b>the dormouse ' s story</b></p>#Print the P tag class name in soup, but here's the first one you can find#Print (soup.p[' class '],type (Soup.p[' class '))#[' title '] <class ' list ' > #类型是个列表#print the A tag in soup, but here's the first one you can find#print (SOUP.A)#<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >Elsie</a>#Print all the A labels#Print (Soup.find_all (' a '))#[<a class= "sister" href= "Http://example.com/elsie " id= "Link1" >ELSIE</A>#<a class= "sister" href= "Http://example.com/lacie " id= "Link2" >LACIE</A>#<a class= "sister" href= "Http://example.com/tillie" id= "Link3" >TILLIE</A>]#Print a id=link3 label#Print (Soup.find (id= "Link3"))#<a class= "sister" href= "Http://example.com/tillie " id= "Link3" >Tillie</a>#Find links to all <a> tags from the documentation:#For link in soup.find_all (' a '):#Print (link.get (' href ')) #Http://example.com/elsie #Http://example.com/lacie #Http://example.com/tillie#get all the text from the document:#print (Soup.get_text ())#The dormouse ' s story##The dormouse ' s story##Once Upon a time there were three little sisters; and their names were#Elsie,#Lacie and#Tillie;#and They lived at the bottom for a well.## ...
TagSoup1 = BeautifulSoup ('<b class= ' boldest ">extremely bold</b>'," lxml " = soup1.b # print (type (tag))# <class ' Bs4.element.Tag ' >
Tag's Name property# print (tag.name) # b # If you change the name of tag, it will affect all HTML documents generated by the current beautiful soup object: # tag.name = "blockquote" # print (TAG) # <blockquote class= "boldest" >extremely bold</blockquote>
tag's Attributes property a tag can have many properties. Tag <b class = " Span style= "color: #800000;" >boldest > has a" class # print (tag[' class ') # [' boldest '] # # Span style= "color: #008000;" > print (tag.attrs) # {' class ': [' boldest ']} # print (soup.a.attrs[' class ']) # [' sister ']
# The properties of the tag can be added, deleted or modified. Again, tag's properties are manipulated in the same way as dictionaries.
# tag[' class '] = ' verybold '
# tag[' id '] = 1
# print (TAG)
# <blockquote class= "Verybold" id= "1" >extremely bold</ Blockquote>
# del tag[' class ']
# del tag[' id ']
# print (tag)
# <blockquote>extremely bold</ Blockquote>
# tag[' class ']
# keyerror: ' class '
# Print (Tag.get (' class '))
# None
child node Operations:. Contents Properties#. Contents#tag's. Contents property can output the child nodes of the tag as a list:#print (soup)#print (soup.contents) #这里打印的是整个html标签#print ("________")#print (soup.head.contents) #打印出来的是head下的列表, you can use the tuple to re-##[' \ n ', <meta charset= "Utf-8"/>, ' \ n ', <title>the dormouse ' s story</title>, ' \ n ']#print (len (soup.head.contents))##5#print (soup.head.contents[1].name)##meta
Interpreter:
Parser |
How to use |
Advantages |
Disadvantage |
Python Standard library |
BeautifulSoup(markup, "html.parser") |
- Python's built-in standard library
- Moderate execution speed
- Strong document Tolerance
|
- Poor document tolerance in versions prior to Python 2.7.3 or 3.2.2
|
lxml HTML Parser |
BeautifulSoup(markup, "lxml") |
- Fast speed
- Strong document Tolerance
|
- Need to install the C language Library
|
lxml XML Parser |
BeautifulSoup(markup, ["lxml-xml"])
BeautifulSoup(markup, "xml")
|
- Fast speed
- The only parser that supports XML
|
- Need to install the C language Library
|
Html5lib |
BeautifulSoup(markup, "html5lib") |
- Best-in-tolerance
- Parsing documents in a browser way
- Generate documents in HTML5 format
|
- Slow speed
- Do not rely on external extensions
|
Python crawler's analytic library beautiful Soup