Detailed description: How Python uses the BeautifulSoup module to search for content

Source: Internet
Author: User
Tags return tag
This article mainly introduces the search method functions of the BeautifulSoup module in python. Different types of filter parameters can be filtered to get the desired results. This article is very detailed and has some reference value for everyone. let's take a look at it. This article mainly introduces the search method functions of the Beautiful Soup module in python. Different types of filter parameters can be filtered to get the desired results. This article is very detailed and has some reference value for everyone. let's take a look at it.

Preface

We will use the search function of the Beautiful Soup module to search by tag name, tag attribute, document text, and regular expression.

Search method

Beautiful Soup's built-in search method is as follows:

  • Find ()

  • Find_all ()

  • Find_parent ()

  • Find_parents ()

  • Find_next_sibling ()

  • Find_next_siblings ()

  • Find_previus_sibling ()

  • Find_previus_siblings ()

  • Find_previous ()

  • Find_all_previous ()

  • Find_next ()

  • Find_all_next ()

Search using the find () method

First, you need to create an HTML file for testing.


  • plants

    100000

  • algae

    100000

  • deer

    1000

  • rabbit

    2000

  • fox

    100

  • bear

    100

  • lion

    80

  • tiger

    50

We can usefind() Method to obtain

    Tag. by default, the first tag appears. Then retrieve
  • Tag. by default, the first tag appears.

    Tag. the output content is used to verify whether the first tag is obtained.


    from bs4 import BeautifulSoupwith open('search.html','r') as filename: soup = BeautifulSoup(filename,'lxml')first_ul_entries = soup.find('ul')print first_ul_entries.li.p.string

    The find () method is as follows:


    find(name,attrs,recursive,text,**kwargs)

    As shown in the code above,find()The method accepts five parameters: name, attrs, recursive, text, and ** kwargs. The name, attrs, and text parameters can be found infind()The method acts as a filter to improve the accuracy of matching results.

    Search tags

    Except for the search of the above code

      You can also search for tags.
    • Tag. the returned result is also the first matched content.


      tag_li = soup.find('li')# tag_li = soup.find(name = "li")print type(tag_li)print tag_li.p.string

      Search text

      If we only want to search by text content, we can pass in only text parameters:


      search_for_text = soup.find(text='plants')print type(search_for_text)
           

      The returned result is also a NavigableString object.

      Search by regular expression

      The following HTML text


      The below HTML has the information that has email ids.

      abc@example.com

      xyz@example.com

      foo@example.com

      The abc @ example email address is not included in any tag, so you cannot find the email address based on the tag. In this case, we can use regular expressions for matching.


      email_id_example = """ 

      The below HTML has the information that has email ids.

      abc@example.com

      xyz@example.com

      foo@example.com """email_soup = BeautifulSoup(email_id_example,'lxml')print email_soup# pattern = "\w+@\w+\.\w+"emailid_regexp = re.compile("\w+@\w+\.\w+")first_email_id = email_soup.find(text=emailid_regexp)print first_email_id

      When a regular expression is used for matching, if multiple matches exist, the first one is returned first.

      Search by tag attribute value

      You can search by tag attribute values:


      search_for_attribute = soup.find(id='primaryconsumers')print search_for_attribute.li.p.string

      Searching based on tag attribute values is available for most attributes, such as id, style, and title.

      However, the two cases may be different:

      • Custom attributes

      • Class attributes

      We can no longer directly use attribute values for search, but must use the attrs parameter to pass itfind()Function.

      Search by custom attributes

      You can add custom attributes to tags in HTML5, for example, adding attributes to tags.

      As shown in the following code, if we perform operations like search id, an error is returned. The Python variable cannot contain the-symbol.


      customattr = """ 

      custom attribute example

      """customsoup = BeautifulSoup(customattr,'lxml')customsoup.find(data-custom="custom")# SyntaxError: keyword can't be an expression

      At this time, the attrs attribute value is used to pass a dictionary type as the parameter for search:


      using_attrs = customsoup.find(attrs={'data-custom':'custom'})print using_attrs

      Search based on classes in CSS

      For CSS class attributes, because class is a keyword in Python, it cannot be passed as a tag attribute parameter. in this case, it is searched like a custom attribute. It also uses the attrs attribute to pass a dictionary for matching.

      In addition to the attrs attribute, you can also use the class _ attribute for transmission, so that it is different from the class and will not cause errors.


      css_class = soup.find(attrs={'class':'producerlist'})css_class2 = soup.find(class_ = "producerlist")print css_classprint css_class2

      Use custom function search

      You canfind() Method to pass a function, so that the search will be performed according to the conditions defined by the function.

      The function should return true or false values.


      def is_producers(tag): return tag.has_attr('id') and tag.get('id') == 'producers'tag_producers = soup.find(is_producers)print tag_producers.li.p.string

      The code defines an is_producers function, which checks whether the tag has a specific id attribute and whether the attribute value is equal to producers. if the tag meets the condition, true is returned. otherwise, false is returned.

      Combined use of various search methods

      Beautiful Soup provides various search methods. Likewise, we can use these methods together to improve search accuracy.


      combine_html = """ 

      Example of p tag with class identical

      Example of p tag with class identical

      """combine_soup = BeautifulSoup(combine_html,'lxml')identical_p = combine_soup.find("p",class_="identical")print identical_p

      Use find_all () to search

      Usefind()The method returns the first matched content from the search results.find_all()Method returns all matching items.

      Infind() The filtering items used in the method can also be used infind_all() Method. In fact, they can be used in any search method, for example:find_parents()Andfind_siblings().


      # Search for tags whose class attributes are tertiaryconsumerlist. All_tertiaryconsumers = soup. find_all (class _ = 'tertiaryconsumerlist') print type (all_tertiaryconsumers) for tertiaryconsumers in all_tertiaryconsumers: print tertiaryconsumers. p. string

      find_all() Method:


      find_all(name,attrs,recursive,text,limit,**kwargs)

      Its parameters andfind()The method is similar. Multiple limit parameters are used. The limit parameter is used to limit the number of results. Whilefind()The limit of the method is 1.

      At the same time, we can also pass a string list parameter to search for tags, tag attribute values, custom attribute values, and CSS classes.


      # Search all p and li tags p_li_tags = soup. find_all (["p", "li"]) print p_li_tagsprint # search for all class attributes. the label all_css_class = soup of producerlist and primaryconsumerlist is used. find_all (class _ = ["producerlist", "primaryconsumerlist"]) print all_css_classprint

      Search related tags

      Generally, we can usefind()Andfind_all() You can also search for tags of interest related to these tags.

      Search for parent tags

      Availablefind_parent() Orfind_parents() Method to search for the parent tag of a tag.

      find_parent()The method returns the first matched content, whilefind_parents()All matched content will be returned.find() Andfind_all()The method is similar.


      # Search for the parent tag primaryconsumers = soup. find_all (class _ = 'primaryconsumerlist') print len (primaryconsumers) # retrieve the first primaryconsumer = primaryconsumers [0] # search for all ul parent labels parent_ul = primaryconsumer. find_parents ('Ul ') print len (parent_ul) # The result will contain all the content of the parent tag print parent_ulprint # Search, take the first parent tag that appears. there are two operations: immediateprimary_consumer_parent = primaryconsumer. find_parent () # immediateprimary_consumer_parent = primaryconsumer. find_parent ('Ul ') print immediateprimary_consumer_parent

      Search for peer tags

      Beautiful Soup also provides the ability to search for peer tags.

      Use functionsfind_next_siblings()The function can search for all the next tags at the same level, whilefind_next_sibling() The function can search for the next tag at the same level.


      producers = soup.find(id='producers')next_siblings = producers.find_next_siblings()print next_siblings

      You can also use find_previous_siblings() Andfind_previous_sibling() Method to search for tags of the same level.

      Search for the next tag

      Usefind_next() The method will search for the first one in the next tag, andfind_next_all()All lower-level label items are returned.


      # Search for the next-level tag first_p = soup. pall_li_tags = first_p.find_all_next ("li") print all_li_tags

      Search for the previous tag

      Similar to searching for the next tag, usefind_previous()Andfind_all_previous() Method to search for the previous tag.

      The above is a detailed description of how Python uses the Beautiful Soup module to search for content. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.