Learning notes for python crawler Beautifulsoup,

Source: Internet
Author: User

Learning notes for python crawler Beautifulsoup,
Related content:

  • What is beautifulsoup?
  • Bs4 usage
    • Import Module
    • Select use parser
    • Search by Tag Name
    • Use find \ find_all to find
    • Search Using select

 

Start Time:

 

What is beautifulsoup:
  • Is a Python library that can extract data from HTML or XML files. It can use your favorite converter to implement the usual document navigation, search, and modify the document method. (official)
  • Beautifulsoup is a parser that can parse specific content, saving us the trouble of writing regular expressions.

 

 

Beautiful Soup 3 has stopped development. We recommend using Beautiful Soup 4 in current projects.

Beautifulsoup: the latest version is bs4.

Bs4 usage: 1. Import module:

From bs4 import beautifulsoup

2. Select the parser to parse the specified content:

Soup = beautifulsoup (resolution content, Parser)

Common parser: html. parser, lxml, xml, and html5lib

Sometimes you need to install a parser, such as pip3 install lxml.

BeautifulSoup supports the Python standard HTML Parsing Library by default, but it also supports some third-party parsing libraries:

Differences between parsers # here is taken from the official document

Beautiful Soup provides the same interface for different Resolvers, but the parser itself is different from each other. after the same document is parsed by different Resolvers, tree-type documents with different structures may be generated. the biggest difference is the HTML Parser and the XML parser. The following part is parsed into an HTML structure:

BeautifulSoup("<a><b /></a>")# 

Because the empty tag <B/> does not comply with the HTML standard, the parser parses it into <B> </B>

The same document uses XML for parsing as follows (the lxml library must be installed for parsing XML ). note: The empty tag <B/> is retained, and the XML header is added before the document, instead of being included in the

BeautifulSoup("<a><b /></a>", "xml")# <?xml version="1.0" encoding="utf-8"?># <a><b/></a>

There are also differences between the HTML Parser. If the HTML document to be parsed is in the standard format, there is no difference between the parser, but the parsing speed is different and the results will return the correct document tree.

However, if the parsed document is not in the standard format, the results returned by different Resolvers may be different. in the following example, the document in the incorrect format is parsed using lxml, and the result </p> label is directly ignored:

BeautifulSoup("<a></p>", "lxml")# 

If you use the html5lib library to parse the same document, different results will be obtained:

BeautifulSoup("<a></p>", "html5lib")# 

The html5lib library does not ignore the </p> label, but automatically completes the label and adds the

Use the pyhton built-in library to parse the results as follows:

BeautifulSoup("<a></p>", "html.parser")# <a></a>

Similar to the lxml [7] Library, the Python built-in library ignores the </p> label, unlike the html5lib library, the standard library does not try to create a document that complies with the standard format or include the document fragments in the <body> tag, unlike lxml, the standard library does not even try to add

Because the document segment "<a> </p>" is incorrectly formatted, the above parsing methods can be regarded as "correct". The html5lib library uses some HTML5 standards, so it is closest to "correct ". however, all parser structures can be considered "normal.

Different Resolvers may affect code execution results.BeautifulSoup, It is best to specify which parser is used to reduce unnecessary troubles.

3. Operation [it is agreed that soup is the parsing object returned by beautifulsoup (resolution content, Parser ]:
  • Search by Tag Name
    • Use the tag name to obtain the node:
      • Soup. Tag Name
    • Use the label name to obtain the node label name. [This is the name. It is mainly used for non-standard signature filtering to obtain the label name of the result ]:
      • Soup. Label. name
    • Use the label name to obtain node attributes:
      • Soup. Tag. attrs [get all attributes]
      • Soup. Tag. attrs [attribute name] [get specified attribute]
      • Soup. Tag [attribute name] [get specified attribute]
      • Soup. Tag. get (attribute name)
    • Use the label name to obtain the text content of the node:
      • Soup. Label. text
      • Soup. Tag. string
      • Soup. Tag. get_text ()

Supplement 1: nesting can be used in the preceding filtering method:

Print (soup. p. a) # a tag under the p tag

Supplement 2: The above name, text, string, attrs and other methods can be used when the result isBs4.element. TagObject:

From bs4 import BeautifulSouphtml = "

    
    • Obtain the subnode [directly obtain '\ n' and think' \ n' is also a tag ]:
      • Soup. Tag. contents [return value is a list]
      • Soup. Label. children [the return value is an iteratable object, and the actual child node needs to be iterated]
    • Get the child node:
      • Soup. Tag. descendants [the return value is also an iteratable object, and the actual subnode needs iteration]
    • Obtain the parent node:
      • Soup. Label. parent
    • Obtain the ancestor node [parent node, grandfather node, great-grandfather node...] :
      • Soup. Label. parents 【]
    • Obtain the sibling node:
      • Soup. next_sibling [obtain a sibling node next to it]
      • Soup. next_siblings [obtain all the following sibling nodes] [return value is an iteratable object]
      • Soup. previus_sibling [obtain the previous sibling node]
      • Soup. previus_siblings [obtain all the previous sibling nodes] [return value is an iteratable object]

 

Supplement 3: Like Supplement 2, the above functions can be used when the result isBs4.element. TagObject.

From bs4 import BeautifulSouphtml = "

 

  • Use the find \ find_all method:
    • Find (name, attrs, recursive, text, ** kwargs) [find the corresponding tag Based on the parameter, but only return the first matching result]
    • Find_all (name, attrs, recursive, text, ** kwargs): [find the corresponding tag Based on the parameter, but only return all results that meet the condition]

    • Filter condition parameters:

      • Name: indicates the tag name. tags are filtered Based on the tag name.

      • Attrs: attribute. tags are filtered Based on Attribute key-value pairs. The value can be attribute name = value, attrs = {attribute name: value} [but because the class is a python keyword, you need to use class _]

      • Text: The text content. The tag is filtered out based on the specified text content. [Use text as the filtering condition separately. Only text is returned. Therefore, it is generally used with other conditions]

      • Recursive: determines whether the filtering is recursive. If it is set to False, it will not be searched in the child node's child node, but will only look for the child node.

    • The result obtained from the node is a bs4.element. Tag object. For details about how to obtain attributes, text content, and Tag names, refer to the method involved in "using Tag filtering results ".

from bs4 import BeautifulSouphtml = """

      
  • Use select to filter [select using CSS selection rules ]:
    • Soup. select ('tag name'), which means to filter the specified tag Based on the tag.
    • # Xxx in CSS indicates the filtering id, and soup. select ('# XXX') indicates filtering the specified tag by id. The returned value is a list.
    • In CSS,. ### indicates filtering class, and soup. select ('. XXX') indicates filtering the specified tag based on class. The returned value is a list.
    • Nested select: soup. select ("# xxx. xxxx), such as ("# id2. news) is the class = "news Tag under the id =" id2 "label, and the return value is a list
    • The result obtained from the node is a bs4.element. Tag object. For details about how to obtain attributes, text content, and Tag names, refer to the method involved in "using Tag filtering results ".
From bs4 import BeautifulSouphtml = "

 

 

Supplement 4:

You can useSoup. prettasks ()To automatically complete, it is generally recommended to use, to avoid code mismatch.

From bs4 import BeautifulSouphtml = "

 

 

For more details, please refer to the official documentation. Fortunately, the Simplified Chinese version is available:

Https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id49

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.