Python uses the Beautiful Soup module to search for details and pythonsoup

Source: Internet
Author: User

Python uses the Beautiful Soup module to search for details and pythonsoup

Preface

We will use the search function of the Beautiful Soup module to search by Tag Name, tag attribute, document text, and regular expression.

Search Method

Beautiful Soup's built-in search method is as follows:

  • Find ()
  • Find_all ()
  • Find_parent ()
  • Find_parents ()
  • Find_next_sibling ()
  • Find_next_siblings ()
  • Find_previus_sibling ()
  • Find_previus_siblings ()
  • Find_previous ()
  • Find_all_previous ()
  • Find_next ()
  • Find_all_next ()

Search using the find () method

First, you need to create an HTML file for testing.

We can usefind() Method to obtain the <ul> tag. By default, the first tag appears. Then obtain the <li> tag. By default, the first tag appears, and then the <div> tag is obtained. The output content is used to verify whether the first tag is obtained.

from bs4 import BeautifulSoupwith open('search.html','r') as filename: soup = BeautifulSoup(filename,'lxml')first_ul_entries = soup.find('ul')print first_ul_entries.li.div.string

The find () method is as follows:

find(name,attrs,recursive,text,**kwargs) 

As shown in the code above,find()The method accepts five parameters: name, attrs, recursive, text, and ** kwargs. The name, attrs, and text parameters can be found infind()The method acts as a filter to improve the accuracy of matching results.

Search tags

In addition to the <ul> tag in the above Code, we can also search for the <li> tag. The returned result is also the first matched content returned.

tag_li = soup.find('li')# tag_li = soup.find(name = "li")print type(tag_li)print tag_li.div.string

Search Text

If we only want to search by text content, we can pass in only text parameters:

search_for_text = soup.find(text='plants')print type(search_for_text)<class 'bs4.element.NavigableString'>

The returned result is also a NavigableString object.

Search by Regular Expression

The following HTML text

<div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div>  <span>foo@example.com</span>

The abc @ example email address is not included in any tag, so you cannot find the email address based on the tag. In this case, we can use regular expressions for matching.

email_id_example = """ <div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div> <span>foo@example.com</span> """email_soup = BeautifulSoup(email_id_example,'lxml')print email_soup# pattern = "\w+@\w+\.\w+"emailid_regexp = re.compile("\w+@\w+\.\w+")first_email_id = email_soup.find(text=emailid_regexp)print first_email_id

When a regular expression is used for matching, if multiple matches exist, the first one is returned first.

Search by TAG Attribute Value

You can search by TAG attribute values:

search_for_attribute = soup.find(id='primaryconsumers')print search_for_attribute.li.div.string

Searching Based on tag attribute values is available for most attributes, such as id, style, and title.

However, the two cases may be different:

  • Custom Attributes
  • Class attributes

We can no longer directly use attribute values for search, but must use the attrs parameter to pass itfind()Function.

Search by custom attributes

You can add custom attributes to tags in HTML5, for example, adding attributes to tags.

As shown in the following code, if we perform operations like search id, an error is returned. The Python variable cannot contain the-symbol.

customattr = """ <p data-custom="custom">custom attribute example</p>   """customsoup = BeautifulSoup(customattr,'lxml')customsoup.find(data-custom="custom")# SyntaxError: keyword can't be an expression

At this time, the attrs attribute value is used to pass a dictionary type as the parameter for search:

using_attrs = customsoup.find(attrs={'data-custom':'custom'})print using_attrs

Search Based on classes in CSS

For CSS class attributes, because class is a keyword in Python, it cannot be passed as a tag attribute parameter. In this case, it is searched like a custom attribute. It also uses the attrs attribute to pass a dictionary for matching.

In addition to the attrs attribute, you can also use the class _ attribute for transmission, so that it is different from the class and will not cause errors.

css_class = soup.find(attrs={'class':'producerlist'})css_class2 = soup.find(class_ = "producerlist")print css_classprint css_class2

Use custom function search

You canfind() Method to pass a function, so that the search will be performed according to the conditions defined by the function.

The function should return true or false values.

def is_producers(tag): return tag.has_attr('id') and tag.get('id') == 'producers'tag_producers = soup.find(is_producers)print tag_producers.li.div.string

The Code defines an is_producers function, which checks whether the tag has a specific id attribute and whether the attribute value is equal to producers. If the tag meets the condition, true is returned. Otherwise, false is returned.

Combined use of various search methods

Beautiful Soup provides various search methods. Likewise, we can use these methods together to improve search accuracy.

combine_html = """ <p class="identical">  Example of p tag with class identical </p> <div class="identical">  Example of div tag with class identical <div> """combine_soup = BeautifulSoup(combine_html,'lxml')identical_div = combine_soup.find("div",class_="identical")print identical_div

Use find_all () to search

Usefind()The method returns the first matched content from the search results.find_all()Method returns all matching items.

Infind() The filtering items used in the method can also be used infind_all() Method. In fact, they can be used in any search method, for example:find_parents()Andfind_siblings().

# Search for tags whose class attributes are tertiaryconsumerlist. All_tertiaryconsumers = soup. find_all (class _ = 'tertiaryconsumerlist') print type (all_tertiaryconsumers) for tertiaryconsumers in all_tertiaryconsumers: print tertiaryconsumers. div. string

find_all() Method:

find_all(name,attrs,recursive,text,limit,**kwargs)

Its Parameters andfind()The method is similar. Multiple limit parameters are used. The limit parameter is used to limit the number of results. Whilefind()The limit of the method is 1.

At the same time, we can also pass a string list parameter to search for tags, tag attribute values, custom attribute values, and CSS classes.

# Search for all div and li tags div_li_tags = soup. find_all (["div", "li"]) print div_li_tagsprint # search for all class attributes. The label all_css_class = soup of producerlist and primaryconsumerlist is used. find_all (class _ = ["producerlist", "primaryconsumerlist"]) print all_css_classprint

Search related tags

Generally, we can usefind()Andfind_all() You can also search for tags of interest related to these tags.

Search for parent tags

Availablefind_parent() Orfind_parents() Method To search for the parent tag of a tag.

find_parent()The method returns the first matched content, whilefind_parents()All matched content will be returned.find() Andfind_all()The method is similar.

# Search for the parent tag primaryconsumers = soup. find_all (class _ = 'primaryconsumerlist') print len (primaryconsumers) # retrieve the first primaryconsumer = primaryconsumers [0] # search for all ul parent labels parent_ul = primaryconsumer. find_parents ('ul ') print len (parent_ul) # The result will contain all the content of the parent tag print parent_ulprint # search, take the first parent tag that appears. there are two operations: immediateprimary_consumer_parent = primaryconsumer. find_parent () # immediateprimary_consumer_parent = primaryconsumer. find_parent ('ul ') print immediateprimary_consumer_parent

Search for peer tags

Beautiful Soup also provides the ability to search for peer tags.

Use Functionsfind_next_siblings()The function can search for all the next tags at the same level, whilefind_next_sibling() The function can search for the next tag at the same level.

producers = soup.find(id='producers')next_siblings = producers.find_next_siblings()print next_siblings

You can also use find_previous_siblings() Andfind_previous_sibling() Method To search for tags of the same level.

Search for the next tag

Usefind_next() The method will search for the first one in the next tag, andfind_next_all()All lower-level label items are returned.

# Search for the next-level tag first_div = soup. divall_li_tags = first_div.find_all_next ("li") print all_li_tags

Search for the previous tag

Similar to searching for the next tag, usefind_previous()Andfind_all_previous() Method To search for the previous tag.

Summary

The above is all about this article. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.