This morning, something was written off. This rotten know, there is a bug, said the automatic save draft, actually did not save. No language
Tonight, we will continue to discuss how to parse an HTML document.
1. String
#直接找元素
Soup.find_all (' B ')
2. Regular expressions
#通过正则找
Import re
For tag in Soup.find_all (Re.compile ("^b")):
Print (Tag.name)
3. List
Find A and B tags
Soup.find_all (["A", "B"])
4.True
Find All Tags
For tag in Soup.find_all (True):
Print (Tag.name)
5. Methods
def has_class_but_no_id (tag):
Return tag.has_attr (' class ') and not tag.has_attr (' id ')
#调用外部方法. Returns only elements that satisfy the method to True
Soup.find_all (has_class_but_no_id)
6.find_all
The Ind_all () method searches all the tag child nodes of the current tag and determines whether the filter is eligible. Here are a few examples:
Soup.find_all ("title")
P-Elements #找class =title
Soup.find_all ("P", "title")
#找所有元素
Soup.find_all ("a")
#通过ID找
Soup.find_all (id= "Link2")
#通过内容找
Import re
Soup.find (Text=re.compile ("Sisters"))
#通过正则: Find element attributes that meet the criteria
Soup.find_all (Href=re.compile ("Elsie"))
#查找包含id的元素
Soup.find_all (Id=true)
#多条件查找
Soup.find_all (Href=re.compile ("Elsie"), id= ' Link1 ')
Some tag properties are not available in search, such as the Data-* property in HTML5
Data_soup = BeautifulSoup (' <div data-foo= ' value ' >foo!</div> ')
Data_soup.find_all (data-foo= "value")
However, you can use the Attrs parameter of the Find_all () method to define a dictionary parameter to search for tags that contain special attributes:
Data_soup.find_all (attrs={"Data-foo": "Value"})
#按CSS搜索 pay attention to the use of class
The ability to search tag by CSS class name is very useful, but the keyword class that identifies the CSS class name is reserved in Python, and using class to do the argument results in a syntax error. Starting with version 4.1.1 of beautiful soup, you can start with the Class_ Parameter search tag with the specified CSS class name
Soup.find_all ("A", class_= "sister")
The Class_ parameter also accepts different types of filters, strings, regular expressions, methods, or True:
Soup.find_all (Class_=re.compile ("ITL"))
def has_six_characters (Css_class):
Return css_class is not None and Len (css_class) = = 6
Soup.find_all (Class_=has_six_characters)
The class property of tag is a multivalued property. When searching for tags by CSS class name, you can search each CSS class name in tag individually:
Css_soup = BeautifulSoup (' <p class= ' body strikeout ' ></p> ')
Css_soup.find_all ("P", class_= "strikeout")
Css_soup.find_all ("P", class_= "body")
You can also search for the class property exactly by CSS values
Css_soup.find_all ("P", class_= "body strikeout")
If the order of the CSS class name does not match the actual value of class, the result will not be searched
Soup.find_all ("A", attrs={"class": "Sister"})
The text parameter allows you to search through the contents of a string in a document.
As with the optional value of the name parameter, the text parameter accepts a string, a regular expression, a list, and True.
Soup.find_all (text= "Elsie")
Soup.find_all (text=["Tillie", "Elsie", "Lacie"])
Soup.find_all (Text=re.compile ("Dormouse"))
def Is_the_only_string_within_a_tag (s):
return (s = = s.parent.string)
Soup.find_all (Text=is_the_only_string_within_a_tag)
Although the text parameter is used to search for a string, it can be mixed with other parameters to filter the tag. Beautiful soup will find the. String method that matches the value of the text parameter. The following code is used to search the contents of the <a> tag containing "Elsie"
Soup.find_all ("A", text= "Elsie")
The Find_all () method returns all the search structures, and searches are slow if the document tree is large. If we don't need all the results, you can use the limit parameter to limit the number of results returned. The effect is similar to the Limit keyword in SQL, when the number of results found reaches the limit Stop searching and return results
Soup.find_all ("A", limit=2)
When you call the Find_all () method of tag, Beautiful soup retrieves all descendants of the current tag, and if you only want to search for the direct child node of the tag, you can use the parameter recursive=false.
Soup.html.find_all ("title")
Soup.html.find_all ("title", Recursive=false)
Find_all () is almost the most commonly used search method in beautiful soup, so we have defined its shorthand method. The BeautifulSoup object and the tag object can be used as a method that executes the same as the Find_all () method that invokes the object, and the following two lines of code are equivalent
Soup.find_all ("a")
Soup ("a")
Soup.title.find_all (Text=true)
Soup.title (Text=true)
7.find
Soup.find_all (' title ', limit=1) is the same as Soup.find (' title ')
Find is the first one that meets the criteria to return. All returns a list, find returns an object
The Find_all () method does not find the target is to return an empty list, and the Find () method returns none when the target is not found
Soup.head.title is the shorthand for tag's name method. The shorthand principle is to call the current tag's find () method multiple times.
Soup.head.title and Soup.find ("Head"). Find ("title")
8.find_parents () and Find_parent ()
Soup = BeautifulSoup (Html_doc, "lxml")
a_string = Soup.find (text= "Lacie")
Print (' 1---------------------------')
Print (a_string)
Print (' 2---------------------------')
#找直接父节点
Print (A_string.find_parents ("a"))
Print (' 3---------------------------')
#迭代找父节点
Print (A_string.find_parent ("P"))
Print (' 4---------------------------')
#找直接父节点
Print (A_string.find_parents ("P", class_= "title"))
9.find_next_siblings () find_next_sibling ()
Soup = BeautifulSoup (Html_doc, "lxml")
a_string = Soup.find (text= "Lacie")
Print (' 1---------------------------')
First_link = Soup.a
Print (First_link)
Print (' 2---------------------------')
#找当前元素的所有后续元素
Print (First_link.find_next_siblings ("a"))
Print (' 3---------------------------')
First_story_paragraph = Soup.find ("P", "story")
#找当前元素的紧接着的第一个元素
Print (First_story_paragraph.find_next_sibling ("P"))
10.find_previous_siblings () and find_previous_sibling ()
Contrary to the 9th direction
Last_link = Soup.find ("A", id= "Link3")
Last_link
Last_link.find_previous_siblings ("a")
First_story_paragraph = Soup.find ("P", "story")
First_story_paragraph.find_previous_sibling ("P")
11.find_all_next () and Find_next ()
These 2 methods iterate through the. Next_elements property on the tag and string after the current tag, and the Find_all_next () method returns all nodes that meet the criteria, and the Find_next () method returns the first eligible node:
First_link.find_all_next (Text=true)
First_link.find_next ("P")
12.find_all_previous () and find_previous ()
These 2 methods iterate through the. Previous_elements property on the tag and string in front of the current node, and the Find_all_previous () method returns all nodes that meet the criteria, and the Find_previous () method returns the first eligible node
First_link.find_all_previous ("P")
First_link.find_previous ("title")
13.CSS Selector
Find elements of Class=title
Soup.select ("title")
Soup.select ("P Nth-of-type (3)")
Find by Element level
Soup.select ("Body a")
Soup.select ("HTML head title")
Find Direct child elements
Soup.select ("head > title")
Soup.select ("p > a")
Soup.select ("p > A:nth-of-type (2)")
Oup.select ("p > #link1")
Up.select ("Body > A")
Find Brother Node tags
Soup.select ("#link1 ~. Sister")
Soup.select ("#link1 +. Sister")
Search by the class name of CSS
Soup.select (". Sister")
The class here does not add _
Soup.select ("[Class~=sister]")
Search by ID of tag
Soup.select ("#link1")
By whether a property exists to find
Oup.select (' a[href] ')
By the value of the property to find
Soup.select (' a[href= ' Http://example.com/elsie "])
#以title结尾
Soup.select (' a[href$= ' Tillie "])
#包含. com
Soup.select (' a[href*= '. Com/el "])
Find by language setting: it is through the element property to find
Multilingual_markup = "" "
<p lang= "en" >Hello</p>
<p lang= "en-us" >howdy, y ' all</p>
<p lang= "EN-GB" >pip-pip, old fruit</p>
<p lang= "fr" >bonjour mes amis</p>
"""
Multilingual_soup = BeautifulSoup (multilingual_markup)
Multilingual_soup.select (' p[lang|=en] ')
This part of the content, the people who understand the jquery at a glance to understand
As a programmer, you must learn to comprehend by analogy
Python crawler series (vi): Search the document tree