Beautifulsoup module,

Source: Internet
Author: User
Tags return tag

Beautifulsoup module,
Beautifulsoup Module 1

Beautiful Soup is a Python library that can extract data from HTML or XML files. it enables you to navigate, search, and modify documents by using your favorite converter. beautiful Soup helps you save hours or even days of work. you may be looking for the Beautiful Soup3 document. Beautiful Soup 3 has been discontinued. We recommend using Beautiful Soup 4 in the current project on the official website and porting it to BS4.

# Install Beautiful Souppip install beautifulsoup4 # install the parser Beautiful Soup to support the HTML Parser In the Python standard library and some third-party Resolvers, one of which is lxml. depending on the operating system, you can install lxml using the following methods: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml another alternative parser is html5lib implemented in pure Python. The Parsing Method of html5lib is the same as that of the browser, you can choose the following method to install html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib

The following table lists the main parser and their advantages and disadvantages. We recommend using lxml as the parser on the official website because it is more efficient. lxml or html5lib must be installed in versions earlier than Python2.7.3 and versions earlier than 3.2.2 In Python3, because the built-in HTML parsing methods in standard libraries of those Python versions are not stable enough.

Parser Usage Advantages Disadvantage
Python standard library BeautifulSoup (markup, "html. parser ")
  • Python built-in standard library
  • Moderate execution speed
  • Strong document Fault Tolerance
  • Python 2.7.3 or 3.2.2) Earlier versions have poor document Fault Tolerance capabilities
Lxml HTML Parser BeautifulSoup (markup, "lxml ")
  • Fast
  • Strong document Fault Tolerance
  • C language library needs to be installed
Lxml XML Parser

BeautifulSoup (markup, ["lxml", "xml"])

BeautifulSoup (markup, "xml ")

  • Fast
  • The only parser that supports XML
  • C language library needs to be installed
Html5lib BeautifulSoup (markup, "html5lib ")
  • Best Fault Tolerance
  • Parse documents in a browser
  • Generate documents in HTML5 format
  • Slow speed
  • Independent of external Scaling

English document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Ii. Basic usage
Html_doc = "3. traverse the Document Tree
# Traversing the document tree: You can directly select the document tree by Tag name, which features fast selection, if there are multiple identical tags, only the first one is returned. #1. Usage #2. Obtain the tag name #3. Obtain the tag attributes #4. Obtain the TAG content #5. nested selection #6. subnode and child node #7. parent node and ancestor node #8. sibling Node
# Traversing the document tree: You can directly select the document tree by Tag name, which features fast selection, however, if multiple identical tags exist, only The first html_doc = "View Code 4 search document tree

1. Five filters

# Search document tree: BeautifulSoup defines many search methods. Here we will introduce two methods: find () and find_all (). parameters and usage of other methods are similar to html_doc = "View Code

2. find_all (name, attrs, recursive, text, ** kwargs)

#2. find_all (name, attrs, recursive, text, ** kwargs) #2.1. name: Search for the value of the name parameter to enable any type of filter, escape character, and regular expression, list, method, or True. print (soup. find_all (name = re. compile ('^ t') #2.2, keyword: key = value form, value can be a filter: String, regular expression, list, True. print (soup. find_all (id = re. compile ('my') print (soup. find_all (href = re. compile ('lacie '), id = re. compile ('\ D') # note that the class must use class_print (soup. find_all (id = True) # search for tags with the id attribute # Some tag attributes cannot be used in the search. For example, the data-* attribute in HTML5: data_soup = BeautifulSoup ('<div data-foo = "value"> foo! </Div> ', 'lxml') # data_soup.find_all (data-foo = "value") # error: SyntaxError: keyword can't be an expression # But you can use find_all () the attrs parameter of the method defines a dictionary parameter to search for tags with special attributes: print (data_soup.find_all (attrs = {"data-foo": "value "})) # [<div data-foo = "value"> foo! </Div>] #2.3. Search by class name. Note that the keywords are class _ and class _ = value. value can be one of the five selectors, print (soup. find_all ('A', class _ = 'sister') # search for the tag print (soup. find_all ('A', class _ = 'Sister sssss') # Find the tag of the class sister and sss, and print (soup. find_all (class _ = re. compile ('^ sis') # Find all the labels whose class is sister #2.4, attrsprint (soup. find_all ('P', attrs = {'class': 'store'}) #2.5. text: The value can be a character, a list, True, and a regular print (soup. find_all (text = 'elsie') print (soup. find_all ('A', text = 'elsie ') #2.6. limit parameter: if the document tree is large, the search will be slow. if you do not need all results, you can use the limit parameter to limit the number of returned results. the result is similar to the limit keyword in SQL. When the number of searched results reaches the limit of limit, print (soup) is stopped. find_all ('A', limit = 2) #2.7. recursive: When the find_all () method of the tag is called, Beautiful Soup searches all child nodes of the current tag, if you only want to search for the direct sub-nodes of a tag, you can use the recursive = False .print(soup.html.find_all('a'{}print(soup.html. find_all ('A', recursive = False) ''' calling tagfind_all () like calling find_all () is almost the most common search method in Beautiful Soup, therefore, we have defined its shorthand method. beautifulSoup object and tag object can be used as a method. The execution result of this method is the same as that of the find_all () method that calls this object. The following two lines of code are equivalent: soup. find_all ("a") soup ("a") is equivalent to soup. title. find_all (text = True) soup. title (text = True )'''
View Code

3. find (name, attrs, recursive, text, ** kwargs)

#3. The find (name, attrs, recursive, text, ** kwargs) find_all () method returns all tags that meet the conditions in the document, although sometimes we only want one result. for example, if there is only one <body> tag in the document, it is not appropriate to use the find_all () method to find the <body> tag, use the find_all method and set the limit = 1 parameter. Instead, use the find () method directly. the following two lines of code are equivalent: soup. find_all ('title', limit = 1) # [<title> The Dormouse's story </title>] soup. find ('title') # The only difference between <title> The Dormouse's story </title> is that The find_all () method returns a list of values containing an element, the find () method directly returns the result. if the find_all () method does not find the target, an empty list is returned. If the find () method cannot find the target, None is returned. print (soup. find ("nosuchtag") # Nonesoup. head. title is short for the tag name method. the abbreviated principle is to call the find () method of the current tag multiple times: soup. head. title # <title> The Dormouse's story </title> soup. find ("head "). find ("title") # <title> The Dormouse's story </title>
View Code

4. Other Methods

# See the official website: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent
View Code

5. CSS Selector

# This module provides the select method to support css. For details, refer to the official website: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37html_doc = "View Code 5 modify the Document Tree

Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40

Summary
# Conclusion: #1. We recommend using the lxml parser library #2. Three selectors are introduced: Tag selector, find and find_all, css selector 1. Weak tag selector filtering function, but it is fast. 2. We recommend that you use find, find_all query matches a single result or multiple results. 3. If you are familiar with the css selector, we recommend that you use select #3. Remember the commonly used methods for retrieving attributes attrs and text values get_text ().

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.