Python crawler-beautifulsoup

Source: Internet
Author: User

2017-07-26 10:10:11

Beautiful soup can parse files in HTML and XML format.

The Beautiful Soup Library is a library of functions that parse, traverse, and maintain the tag tree . Using the BeautifulSoup library is very simple, just two lines of code, you can complete the creation of the BeautifulSoup class, which is named soup, then the soup can be related to processing. A BeautifulSoup class corresponds to the entire contents of HTML or XML.

BeautifulSoup Library converts any HTML file to utf-8 format

First, the parser

When the BeautifulSoup class is created, the second argument is the parser, the parser in the code above is ' Html.parser ', and the parser that BeautifulSoup supports is:

Ii. basic elements of the BeautifulSoup class

    • Use Soup.tag to access the contents of a tag, such as: SOUP.TITLE;SOUP.A, etc., where the return value is the first occurrence of the access tag
    • Use Soup.tag.name to get the name of the current tag, the return value is a string, such as: Soup.a.name will return the string ' A ', you can also use Soup.a.parent.name to view a tag parent's name
    • Using Soup.tag.attrs, you can get the properties of the current tag, the return value is a dictionary, and if no property returns an empty dictionary, such as: Soup.a.attrs returns the property information of the A tag
    • Use Soup.tag.string to get a string of the current tag, such as: Soup.a.string returns the content string of the A tag
    • There are two types of content strings, one is the navigablestring type, one is the comment type, the format of the comment type is <p> <!--the is a comment--></p> The call to Soup.p.string is returned by the IS-an comment, but its type is comment type.

Iii. content Traversal of soup

There are three ways to traverse a tag tree, that is, downlink traversal, upstream traversal, and parallel traversal.

(1) Downlink traversal property

Example:

# Traverse son node  for child  in soup.body.children:    print(child)#  Traverse descendant nodes for children  in   soup.body.descendants:print(child)

It is important to note that descendant nodes contain not only labels, but also string types between tags, which need to be noted and excluded.

(2) Properties of upstream traversal

The soup.parent is empty and needs to be differentiated, and the parents can be traversed using a For loop:

(3) Properties of parallel traversal

# Traverse subsequent nodes  for sibling  in soup.a.next_sibling:    print(sibling)# traversing a previous node  for sibling  in soup.a.previous_sibling:    print(sibling)

Python crawler-beautifulsoup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.