Python crawler series (vi): Search the document tree

Source: Internet
Author: User
Tags return tag types of filters

This morning, something was written off. This rotten know, there is a bug, said the automatic save draft, actually did not save. No language

Tonight, we will continue to discuss how to parse an HTML document.

1. String

#直接找元素
Soup.find_all (' B ')

2. Regular expressions

#通过正则找
Import re
For tag in Soup.find_all (Re.compile ("^b")):
Print (Tag.name)

3. List

Find A and B tags

Soup.find_all (["A", "B"])

4.True

Find All Tags

For tag in Soup.find_all (True):
Print (Tag.name)

5. Methods

def has_class_but_no_id (tag):
Return tag.has_attr (' class ') and not tag.has_attr (' id ')

#调用外部方法. Returns only elements that satisfy the method to True

Soup.find_all (has_class_but_no_id)

6.find_all

The Ind_all () method searches all the tag child nodes of the current tag and determines whether the filter is eligible. Here are a few examples:

Soup.find_all ("title")

P-Elements #找class =title

Soup.find_all ("P", "title")

#找所有元素

Soup.find_all ("a")

#通过ID找

Soup.find_all (id= "Link2")

#通过内容找

Import re
Soup.find (Text=re.compile ("Sisters"))

#通过正则: Find element attributes that meet the criteria

Soup.find_all (Href=re.compile ("Elsie"))

#查找包含id的元素

Soup.find_all (Id=true)

#多条件查找

Soup.find_all (Href=re.compile ("Elsie"), id= ' Link1 ')

Some tag properties are not available in search, such as the Data-* property in HTML5

Data_soup = BeautifulSoup (' <div data-foo= ' value ' >foo!</div> ')
Data_soup.find_all (data-foo= "value")

However, you can use the Attrs parameter of the Find_all () method to define a dictionary parameter to search for tags that contain special attributes:

Data_soup.find_all (attrs={"Data-foo": "Value"})

#按CSS搜索 pay attention to the use of class

The ability to search tag by CSS class name is very useful, but the keyword class that identifies the CSS class name is reserved in Python, and using class to do the argument results in a syntax error. Starting with version 4.1.1 of beautiful soup, you can start with the Class_ Parameter search tag with the specified CSS class name

Soup.find_all ("A", class_= "sister")

The Class_ parameter also accepts different types of filters, strings, regular expressions, methods, or True:

Soup.find_all (Class_=re.compile ("ITL"))

def has_six_characters (Css_class):
Return css_class is not None and Len (css_class) = = 6

Soup.find_all (Class_=has_six_characters)

The class property of tag is a multivalued property. When searching for tags by CSS class name, you can search each CSS class name in tag individually:

Css_soup = BeautifulSoup (' <p class= ' body strikeout ' ></p> ')
Css_soup.find_all ("P", class_= "strikeout")

Css_soup.find_all ("P", class_= "body")

You can also search for the class property exactly by CSS values

Css_soup.find_all ("P", class_= "body strikeout")

If the order of the CSS class name does not match the actual value of class, the result will not be searched

Soup.find_all ("A", attrs={"class": "Sister"})

The text parameter allows you to search through the contents of a string in a document.

As with the optional value of the name parameter, the text parameter accepts a string, a regular expression, a list, and True.

Soup.find_all (text= "Elsie")

Soup.find_all (text=["Tillie", "Elsie", "Lacie"])

Soup.find_all (Text=re.compile ("Dormouse"))

def Is_the_only_string_within_a_tag (s):
return (s = = s.parent.string)

Soup.find_all (Text=is_the_only_string_within_a_tag)

Although the text parameter is used to search for a string, it can be mixed with other parameters to filter the tag. Beautiful soup will find the. String method that matches the value of the text parameter. The following code is used to search the contents of the <a> tag containing "Elsie"

Soup.find_all ("A", text= "Elsie")

The Find_all () method returns all the search structures, and searches are slow if the document tree is large. If we don't need all the results, you can use the limit parameter to limit the number of results returned. The effect is similar to the Limit keyword in SQL, when the number of results found reaches the limit Stop searching and return results

Soup.find_all ("A", limit=2)

When you call the Find_all () method of tag, Beautiful soup retrieves all descendants of the current tag, and if you only want to search for the direct child node of the tag, you can use the parameter recursive=false.

Soup.html.find_all ("title")

Soup.html.find_all ("title", Recursive=false)

Find_all () is almost the most commonly used search method in beautiful soup, so we have defined its shorthand method. The BeautifulSoup object and the tag object can be used as a method that executes the same as the Find_all () method that invokes the object, and the following two lines of code are equivalent

Soup.find_all ("a")
Soup ("a")

Soup.title.find_all (Text=true)
Soup.title (Text=true)

7.find

Soup.find_all (' title ', limit=1) is the same as Soup.find (' title ')

Find is the first one that meets the criteria to return. All returns a list, find returns an object

The Find_all () method does not find the target is to return an empty list, and the Find () method returns none when the target is not found

Soup.head.title is the shorthand for tag's name method. The shorthand principle is to call the current tag's find () method multiple times.

Soup.head.title and Soup.find ("Head"). Find ("title")

8.find_parents () and Find_parent ()

Soup = BeautifulSoup (Html_doc, "lxml")
a_string = Soup.find (text= "Lacie")
Print (' 1---------------------------')
Print (a_string)
Print (' 2---------------------------')
#找直接父节点
Print (A_string.find_parents ("a"))
Print (' 3---------------------------')
#迭代找父节点
Print (A_string.find_parent ("P"))
Print (' 4---------------------------')
#找直接父节点
Print (A_string.find_parents ("P", class_= "title"))

9.find_next_siblings () find_next_sibling ()

Soup = BeautifulSoup (Html_doc, "lxml")
a_string = Soup.find (text= "Lacie")
Print (' 1---------------------------')
First_link = Soup.a
Print (First_link)
Print (' 2---------------------------')
#找当前元素的所有后续元素
Print (First_link.find_next_siblings ("a"))
Print (' 3---------------------------')
First_story_paragraph = Soup.find ("P", "story")
#找当前元素的紧接着的第一个元素
Print (First_story_paragraph.find_next_sibling ("P"))

10.find_previous_siblings () and find_previous_sibling ()

Contrary to the 9th direction

Last_link = Soup.find ("A", id= "Link3")
Last_link
Last_link.find_previous_siblings ("a")
First_story_paragraph = Soup.find ("P", "story")
First_story_paragraph.find_previous_sibling ("P")

11.find_all_next () and Find_next ()

These 2 methods iterate through the. Next_elements property on the tag and string after the current tag, and the Find_all_next () method returns all nodes that meet the criteria, and the Find_next () method returns the first eligible node:

First_link.find_all_next (Text=true)
First_link.find_next ("P")

12.find_all_previous () and find_previous ()

These 2 methods iterate through the. Previous_elements property on the tag and string in front of the current node, and the Find_all_previous () method returns all nodes that meet the criteria, and the Find_previous () method returns the first eligible node

First_link.find_all_previous ("P")

First_link.find_previous ("title")

13.CSS Selector

Find elements of Class=title

Soup.select ("title")
Soup.select ("P Nth-of-type (3)")

Find by Element level

Soup.select ("Body a")

Soup.select ("HTML head title")

Find Direct child elements

Soup.select ("head > title")

Soup.select ("p > a")

Soup.select ("p > A:nth-of-type (2)")

Oup.select ("p > #link1")

Up.select ("Body > A")

Find Brother Node tags

Soup.select ("#link1 ~. Sister")

Soup.select ("#link1 +. Sister")

Search by the class name of CSS

Soup.select (". Sister")

The class here does not add _

Soup.select ("[Class~=sister]")

Search by ID of tag

Soup.select ("#link1")

By whether a property exists to find

Oup.select (' a[href] ')

By the value of the property to find

Soup.select (' a[href= ' Http://example.com/elsie "])

#以title结尾

Soup.select (' a[href$= ' Tillie "])

#包含. com

Soup.select (' a[href*= '. Com/el "])

Find by language setting: it is through the element property to find

Multilingual_markup = "" "
<p lang= "en" >Hello</p>
<p lang= "en-us" >howdy, y ' all</p>
<p lang= "EN-GB" >pip-pip, old fruit</p>
<p lang= "fr" >bonjour mes amis</p>
"""

Multilingual_soup = BeautifulSoup (multilingual_markup)


Multilingual_soup.select (' p[lang|=en] ')

This part of the content, the people who understand the jquery at a glance to understand

As a programmer, you must learn to comprehend by analogy

Python crawler series (vi): Search the document tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.