Detailed Python search content method using beautiful Soup module

Source: Internet
Author: User
Tags return tag tag name
This article mainly introduces the search method function of Beautiful Soup module in Python. Methods different types of filtering parameters can be filtered to get the desired results. The article introduced in very detailed, has certain reference value for everybody, needs the friends to take a look below together.

Objective

We will use the search function of the Beautiful Soup module to search by tag name, tag properties, document text, and regular expressions.

Search method

Beautiful Soup Built-in search methods are as follows:

    • Find ()

    • Find_all ()

    • Find_parent ()

    • Find_parents ()

    • Find_next_sibling ()

    • Find_next_siblings ()

    • Find_previous_sibling ()

    • Find_previous_siblings ()

    • Find_previous ()

    • Find_all_previous ()

    • Find_next ()

    • Find_all_next ()

Search using the Find () method

The first thing you need to do is create an HTML file for testing.


We can find() get the <ul> tag by means of the method, and by default it will be the first one to appear. Then we get the <li> tag, and by default, we get the first one, and then we get the <p> tag, which verifies that the first occurrence of the label is obtained by outputting the content.


From BS4 import Beautifulsoupwith open (' search.html ', ' R ') as Filename:soup = BeautifulSoup (filename, ' lxml ') first_ul_ Entries = Soup.find (' ul ') print first_ul_entries.li.p.string

The Find () method is specific to the following:


Find (Name,attrs,recursive,text,**kwargs)

As shown in the code above, the find() method accepts five parameters: name, Attrs, recursive, text, and **kwargs. Both the name, Attrs, and text parameters can act as filters in the find() method, improving the accuracy of the matching results.

Search tags

In addition to the search <ul> label for the above code, we can also search the <li> tag and return the result is the first match that appears.


Tag_li = soup.find (' li ') # tag_li = soup.find (name = "Li") print type (tag_li) print tag_li.p.string

Search text

If we only want to search for text content, we can just pass in the text parameter:


Search_for_text = Soup.find (text= ' plants ') print type (search_for_text) <class ' bs4.element.NavigableString ' >

The returned result is also a Navigablestring object.

Search by Regular expression

The next section of HTML text content


<p>the below HTML has the information-has email ids.</p> abc@example.com <p>xyz@example.com</p& gt;  <span>foo@example.com</span>

You can see that the abc@example email address is not included in any tags, so you can't find the email address based on the tag. At this point, we can use regular expressions to match.


Email_id_example = "" <p>the below HTML have the information that have email ids.</p> abc@example.com <p> Xyz@example.com</p> <span>foo@example.com</span> "" Email_soup = BeautifulSoup (Email_id_example, ' lxml ') print email_soup# pattern = "\w+@\w+\.\w+" emailid_regexp = Re.compile ("\w+@\w+\.\w+") first_email_id = Email_ Soup.find (text=emailid_regexp) Print first_email_id

When a regular expression is used for a match, if there are multiple occurrences, the first is returned first.

Search by Tag Property value

You can search by the property value of the tag:


Search_for_attribute = Soup.find (id= ' primaryconsumers ') print search_for_attribute.li.p.string

Search by Tag property values is available for most properties, for example: ID, style, and title.

However, there are two things that can be different in the following ways:

    • Custom properties

    • Class Property

Instead of using property values directly to search, we have to use the Attrs parameter to pass to the find() function.

Search by custom attribute

In HTML5, you can add custom attributes to tags, such as adding attributes to tags.

As shown in the following code, if we do the same thing as the search ID, we will get an error, and the Python variable cannot include the-symbol.


customattr = "" "<p data-custom=" Custom ">custom attribute example</p>   " "" Customsoup = BeautifulSoup ( Customattr, ' lxml ') customsoup.find (data-custom= "Custom") # Syntaxerror:keyword can ' t be an expression

This time use the Attrs property value to pass a dictionary type as a parameter to search:


Using_attrs = Customsoup.find (attrs={' data-custom ': ' Custom '}) print Using_attrs

Search based on a class in CSS

For CSS class properties, because class is a keyword in Python, it cannot be passed as a label attribute parameter, in which case it is searched just like a custom property. Also use the Attrs property to pass a dictionary to match.

In addition to using the Attrs property, you can also use the Class_ property to pass, which distinguishes it from class and does not cause errors.


Css_class = Soup.find (attrs={' class ': ' Producerlist '}) Css_class2 = Soup.find (Class_ = "producerlist") print Css_ Classprint Css_class2

Search using a custom function

You can find() pass a function to a method so that it is searched according to the criteria defined by the function.

The function should return a value of true or false.


def is_producers (tag): Return tag.has_attr (' id ') and tag.get (' id ') = = ' producers ' tag_producers = Soup.find (is_producers ) Print tag_producers.li.p.string

The code defines a is_producers function that checks whether the tag is specific to the ID attribute and whether the property value is equal to producers and returns true if the condition is met, otherwise false.

Federated use of various search methods

Beautiful Soup provides a variety of search methods, and we can also combine these methods to match and improve the accuracy of your search.


combine_html = "" "<p class=" identical ">  Example of P tag with class identical </p> <p class=" identical ">  Example of P tag with class identical <p>" "" Combine_soup = BeautifulSoup (combine_html, ' lxml ') Identical_ p = combine_soup.find ("P", class_= "identical") print identical_p

Search using the Find_all () method

The usage find() method returns the first matching content from the search results, and the find_all() method returns all matching items.

The find() filter items used in the method can also be used in the find_all() method. In fact, they can be used in any search method, for example: find_parents() and find_siblings() medium.


# Search for all tags with the class attribute equal to Tertiaryconsumerlist. All_tertiaryconsumers = Soup.find_all (class_= ' tertiaryconsumerlist ') print type (all_tertiaryconsumers) for Tertiaryconsumers in All_tertiaryconsumers:print tertiaryconsumers.p.string

find_all() The method is:


Find_all (Name,attrs,recursive,text,limit,**kwargs)

Its parameters and find() methods are somewhat similar, with multiple limit parameters. The limit parameter is used to restrict the number of results. and find() the limit of the method is 1.

At the same time, we can also pass parameters of a string list to search for tags, tag property values, custom attribute values, and CSS classes.


# Search all P and li tags p_li_tags = soup.find_all (["P", "Li"]) print p_li_tagsprint# Search all class properties are producerlist and primaryconsumerlist of the label Sign All_css_class = Soup.find_all (class_=["Producerlist", "Primaryconsumerlist"]) print All_css_classprint

Search Related tags

In general, we can use find() and find_all() methods to search for the specified tags, as well as to search for other tags related to these tags.

Search Parent Tags

You can use find_parent() or find_parents() method to search for the parent tag of a label.

find_parent()Method will return the first match, and find_parents() all matching content will be returned, which is similar to the find() find_all() method.


# Search Parent Tag primaryconsumers = Soup.find_all (class_= ' primaryconsumerlist ') print len (primaryconsumers) # Take the parent tag first Primaryconsumer = primaryconsumers[0]# Search All ul parent tags Parent_ul = primaryconsumer.find_parents (' ul ') print Len ( Parent_ul) # Results will contain all contents of the parent tag print parent_ulprint# search, taking the first occurrence of the parent tag. There are two operations immediateprimary_consumer_parent = Primaryconsumer.find_parent () # immediateprimary_consumer_parent = primaryconsumer.find_parent (' ul ') print Immediateprimary_consumer_parent

Search for sibling tags

Beautiful Soup also provides the ability to search for sibling tags.

Use function find_next_siblings() functions to search for the next label at the same level, and the find_next_sibling() function to search for the next label at the same level.


Producers = Soup.find (id= ' producers ') Next_siblings = Producers.find_next_siblings () print next_siblings

Similarly, you can use the find_previous_siblings() and find_previous_sibling() methods to search for labels for the previous sibling.

Search for next label

Using the find_next() method will search for the first occurrence of the next label, and will find_next_all() return all the subordinate label items.


# Search Next Level label first_p = Soup.pall_li_tags = First_p.find_all_next ("li") print All_li_tags

Search for previous label

Similar to searching for the next tag, use find_previous() and find_all_previous() methods to search for the previous label.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.