2017-07-26 10:10:11
Beautiful soup can parse files in HTML and XML format.
The Beautiful Soup Library is a library of functions that parse, traverse, and maintain the tag tree . Using the BeautifulSoup library is very simple, just two lines of code, you can complete the creation of the BeautifulSoup class, which is named soup, then the soup can be related to processing. A BeautifulSoup class corresponds to the entire contents of HTML or XML.
BeautifulSoup Library converts any HTML file to utf-8 format
First, the parser
When the BeautifulSoup class is created, the second argument is the parser, the parser in the code above is ' Html.parser ', and the parser that BeautifulSoup supports is:
Ii. basic elements of the BeautifulSoup class
- Use Soup.tag to access the contents of a tag, such as: SOUP.TITLE;SOUP.A, etc., where the return value is the first occurrence of the access tag
- Use Soup.tag.name to get the name of the current tag, the return value is a string, such as: Soup.a.name will return the string ' A ', you can also use Soup.a.parent.name to view a tag parent's name
- Using Soup.tag.attrs, you can get the properties of the current tag, the return value is a dictionary, and if no property returns an empty dictionary, such as: Soup.a.attrs returns the property information of the A tag
- Use Soup.tag.string to get a string of the current tag, such as: Soup.a.string returns the content string of the A tag
- There are two types of content strings, one is the navigablestring type, one is the comment type, the format of the comment type is <p> <!--the is a comment--></p> The call to Soup.p.string is returned by the IS-an comment, but its type is comment type.
Iii. content Traversal of soup
There are three ways to traverse a tag tree, that is, downlink traversal, upstream traversal, and parallel traversal.
(1) Downlink traversal property
Example:
# Traverse son node for child in soup.body.children: print(child)# Traverse descendant nodes for children in soup.body.descendants:print(child)
It is important to note that descendant nodes contain not only labels, but also string types between tags, which need to be noted and excluded.
(2) Properties of upstream traversal
The soup.parent is empty and needs to be differentiated, and the parents can be traversed using a For loop:
(3) Properties of parallel traversal
# Traverse subsequent nodes for sibling in soup.a.next_sibling: print(sibling)# traversing a previous node for sibling in soup.a.previous_sibling: print(sibling)
Python crawler-beautifulsoup