Tag: Hello requires element ROM example structure format TTL nbsp
The last article of the regular, in fact, for many people to use it is inconvenient, coupled with the need to remember a lot of rules, so use is not particularly skilled, and this section we mentioned BeautifulSoup is a very powerful tool, crawler weapon.
BeautifulSoup "Delicious soup, Green bisque"
A flexible and convenient page parsing library, processing efficient, support a variety of parsers.
It can not be used to write regular expressions and it is convenient to crawl the Web page information.
Quick to use
Here's an example of a simple understanding of BS4 and a look at its strengths:
fromBs4Importbeautifulsouphtml=" "" "Soup= BeautifulSoup (HTML,'lxml')Print(Soup.prettify ())Print(Soup.title)Print(Soup.title.name)Print(soup.title.string)Print(Soup.title.parent.name)Print(SOUP.P)Print(soup.p["class"])Print(SOUP.A)Print(Soup.find_all ('a'))Print(Soup.find (id='Link3'))
The results are as follows:
Using BeautifulSoup to parse this code, you can get a BeautifulSoup object and be able to output the structure in the standard indented format.
At the same time, we can get all the links and the text content by the following code separately:
for in Soup.find_all ('a'): Print(link.get (' href'))print(Soup.get_text ())
Parser
Beautiful soup supports the HTML parser in the Python standard library and supports some third-party parsers, and if we do not install it, Python uses the Python default parser, which is more powerful, faster, and recommended to install than the lxml parser.
The following are common parsers:
It is recommended to use lxml as a parser because it is more efficient. Prior to 3.2.2 versions prior to Python2.7.3 and Python3, lxml or html5lib must be installed because the HTML parsing methods built into the Python version of the standard library are not stable enough.
Basic use
Tag Selector
In quick use we add the following code:
Print (Soup.title)
Print (Type (soup.title))
Print (Soup.head)
Print (SOUP.P)
With this soup. Tag name We can get the contents of this tag
Here's a question to note, in this way to get the label, if there are multiple such tags in the document, the result returned is the contents of the first label, as above we get P tag through SOUP.P, and the document has more than one P tag, but only the first P label content is returned
Get Name
When we pass the soup.title.name, we can get the name of the title tag, which is the title
Get Properties
Print (soup.p.attrs[' name '])
Print (soup.p[' name '])
The Name property value of the P tag can be obtained in either of the above two ways
Get content
Print (soup.p.string)
As a result, you can get the contents of the first P tag:
The Dormouse ' s story
Nested selection
We can get it directly from the nested way below
Print (soup.head.title.string)
Child nodes and descendant nodes
Use of Contents
This is illustrated by the following example:
HTML ="""""" fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(soup.p.contents)
The result is that all the sub-labels under the p tag are stored in a list
The following elements are stored in the list
Use of children
It is also possible to get all the child nodes under the P tag in the same way as the results obtained by contents are the same, but the difference is that Soup.p.children is an iterative object, not a list, and can only get the information that is known through a circular way.
Print (Soup.p.children) for inch Enumerate (Soup.p.children): Print (I,child)
Through contents and children are all acquired child nodes, if you want to get descendants of the node can be descendants
Print (soup.descendants) The result of this acquisition is also an iterator
Parent and ancestor nodes
The information of the parent node can be obtained by soup.a.parent
The ancestor node can be obtained through list (enumerate (soup.a.parents)), the result of which is a list of the parent node of the a tag is stored in the list, and the parent node is placed in the list, and finally the entire document is put into the list , the last element of all the lists, and the second-lowest element, are the information for the entire document.
Brother Node
Soup.a.next_siblings get the rear sibling node
Soup.a.previous_siblings get the previous sibling node
Soup.a.next_sibling Get Next brother Tag
Souo.a.previous_sinbling Get Previous Sibling tags
Standard selector Find_all
Find_all (Name,attrs,recursive,text,**kwargs)
You can find documents based on tag name, properties, content
Use of name
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all ('ul'))Print(Type (Soup.find_all ('ul') [0]))
The result is a list of ways to return
At the same time we can find_all the results again to get all the Li tag information
for in Soup.find_all ('ul'): Print(Ul.find_all (' Li '))
Attrs
Examples are as follows:
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all (attrs={'ID':'list-1'}))Print(Soup.find_all (attrs={'name':'Elements'}))
Attrs can pass in a dictionary to find a label, but here's a special class, because class is a special field in Python, so if you want to find class-related can change attrs={' class_ ': ' Element '} or Soup.find_all (", {" class ":" Element} "), special tag attributes can not write attrs, such as Id,class, etc.
Text
Examples are as follows:
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.find_all (text='Foo'))
The result is the text of all the text= ' Foo ' found.
Find
Find (Name,attrs,recursive,text,**kwargs)
Find returns the first element of a matching result
Some other similar usage:
Find_parents () Returns all ancestor nodes, Find_parent () returns the immediate parent node.
Find_next_siblings () returns all the sibling nodes behind, find_next_sibling () returns to the first sibling node.
Find_previous_siblings () returns all previous sibling nodes, and find_previous_sibling () returns the first sibling node in front of it.
Find_all_next () returns all eligible nodes after a node, Find_next () returns the first eligible node
Find_all_previous () returns all eligible nodes after a node, find_previous () returns the first eligible node
CSS Selector
You can complete the selection by directly passing in the CSS selector via select ()
People who are familiar with the front end may know more about CSS, but the same is true of usage.
. Represents class #表示id
Label 1, label 2 Find all tags 1 and label 2
Label 1 label 2 find label 1 internal all tags 2
[attr] can find all tags with a property in this way
[Atrr=value] example [Target=_blank] means finding labels for all Target=_blank
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml')Print(Soup.select ('. Panel. Panel-heading'))Print(Soup.select ('ul Li'))Print(Soup.select ('#list-2. Element'))Print(Type (Soup.select ('ul') [0]))
Get content
Text content can be obtained by Get_text ()
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml') forLiinchSoup.select ('Li'): Print(Li.get_text ())
Get Properties
Or the property can be passed by [property name] or attrs[property name]
Html=" "<div class= "Panel" > <div class= "panel-heading" > " " fromBs4ImportBeautifulsoupsoup= BeautifulSoup (HTML,'lxml') forUlinchSoup.select ('ul'): Print(ul['ID']) Print(ul.attrs['ID'])
Summarize
It is recommended to use the Lxml parsing library, using html.parser if necessary
Label selection filtering is weak but fast
We recommend that you use the Find (), Find_all () query to match a single result or multiple results
Use SELECT () if you are familiar with the CSS selector
Remember common ways to get properties and text values
Python crawler from Getting started to discarding (vi) the use of the BeautifulSoup library