Python crawler tool: Beautiful Soup

Last Update:2017-12-02 Source: Internet

Author: User

Tags tag name

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beautiful Soup is a python library that can extract data from HTML or XML files. Using it to manipulate HTML pages is as handy as JavaScript code to manipulate HTML DOM trees. Official Chinese document Address

1. Install 1.1 Install Beautiful Soup

Beautiful Soup3 has now stopped maintenance, recommended the use of Beautiful SOUP4, has now been ported to BS4, import from BS4 to import. The installation method is as follows:

# 使用 pip 安装pip install beautifulsoup4# 使用 easy_install 安装easy_install beautifulsoup4

1.2 Installing the parser lxml

Also need to install the corresponding parser, lxml,html5lib optional one can.

# 安装 lxmlpip install lxml# 安装 html5libpip install html5lib

1.3 How to use

After installing BeautifulSoup, you can import and use. By passing a document into the BeautifulSoup constructor, you can get a document object that can pass in a string or a file handle.

# 首先从 bs4 导入from bs4 inport BeautifulSoup# 使用解析器和html文档可以初始化= BeautifulSoup(open("index.html"‘lxml‘=‘=‘lxml‘)

The document needs to be converted to Unicode, and instances of HTML are converted to Unicode encoding

2. Objects in the BeautifulSoup

Beautiful soup transforms complex HTML documents into a complex tree structure, similar to the number of DOM nodes in a browser, each node is a Python object, and all objects can be summed up in 4 types: Tag, navigablestring, BeautifulSoup , Comment.

2.1 Tag Object

The tag object is similar to a label node. Same as tags in XML or HTML native documents, such as Body,div,a,span. The tag object has many methods and properties. The properties of the tag object can be added and manipulated like a dictionary.

2.1.1 Name property

The Name property represents the tag's names. Get through. Name. If you change the name of tag, it will affect all HTML documents generated by the current beautiful soup object.

2.1.2 Attributes Properties

A tag may have many properties, and you can use Tag.attrs to get all of the node properties of the tag, and to add and revise these attributes. Here's how to get it:

Tag.attrs: Get Property List
TAG.ATTRS[1]: Gets the 2nd attribute in the property list
Tag.get (' href '): Get href attribute
tag[' href ': Get href attribute

2.1.3 Multi-valued attribute

In an HTML document, there are typical class-like classes with multiple property values, and the values returned by these multivalued properties are not string, but list. The node types for these multivalued properties are as follows:

Class
Rel
Rev
Accept-charset
Headers
AccessKey

There are no multivalued attributes in the XML document

=‘<a href="index.html" class="button button-blue" data="1 2 3"></a>‘=‘lxml‘= soup.a  # 获取 a 标签tag.name  # 标签名称：atag.attrs  # 属性列表：[‘href‘, ‘class‘, ‘data‘]tag.get(‘href‘)  # 获取href属性：index.htmltag[‘class‘]  # 获取class属性为list：[button,button-blue]tag[‘data‘]  # data属性的值为string：1 2 3

2.2 Navigablestring Object

Strings are often contained within tags, and Beautiful soup use the Navigablestring class to wrap the string in the tag.

Use tag.string to get the tag inside the string, for navigablestring
converting to a generic Unicode string using Unicode (tag.string)
String cannot be edited in tag
Tag.string.replace_with (' content ') replace tag inside string
The string inside the tag does not support the. Contents or. String property or the Find () method
Using Navigablestring objects outside of beautiful soup requires calling the Unicode () method

2.3 BeautifulSoup Object

The BeautifulSoup object represents the entire contents of a document. Most of the time, you can think of it as a tag object.

Because the BeautifulSoup object is not real HTML or XML, the Tag,beautifulsoup object contains a special property that has a value of "[Document]". Name.

2.4 Comment Object

The Comment object is a special type of navigablestring object that is used to represent the comment portion of the document.

="<b><!--Hey, buddy. Want to buy a used parser?--></b>"== soup.b.stringtype(comment)  # <class ‘bs4.element.Comment‘>print(soup.b.prettify())# <b>#  <!--Hey, buddy. Want to buy a used parser?--># </b>

3. Traverse the document Tree

By traversing the document tree, you can find the specified content from the document.

3.1 Child nodes

A tag may contain multiple strings or other tags, which are the top-level child nodes, and Beautiful soup provides many operations and iterates over the properties of the child nodes. Need to note:

The string has no child nodes.
The BeautifulSoup object itself is bound to contain child nodes

Suppose there are several simple ways to get a child node:

Use tag name: Gets the first direct child node: SOUP.DIV.P
Using the Contents property: Gets a list of all immediate child nodes: soup.div.contents
Use the Children property: loop A child node: Soup.div.children
Use the Descendants property: Recursive loops for all tag descendants: Soup.div.descendants
Use the string property: Gets a child node with only one string child node tag: p.string
Use the Strings property: Loop with multiple string child nodes: div.strings

Div_html= '<div><p>uu</p><p>sa</p><p><a>ma</a> -</p></div> 'Soup=BeautifulSoup (div_html),' lxml ') Div=Soup.div# Get div nodeDiv.p# <p>uu</p>Div.contents# [<p>uu</p>, <p>sa</p>, <p><a>ma</a>--></p>]div.contents[0]# <p>uu</p> forChildinchDiv.children:Print(child)# <p>uu</p><p>sa</p><p><a>ma</a>--></p> forChildinchDiv.descendants:pring (Child)

3.2 Parent Node

Each tag or string has a parent node, that is, each node is included in the tag, and the. Parent property is used to get the parent node of an element: P.parent, the. Parents property of the element allows you to recursively get all the ancestor nodes of the element.

=  ‘lxml‘)  # 使用3.1中定义的 div_html=# 获取 div 节点= div.a.string  # 第一个 a 节点的stringsa.parent  # a 节点# div 节点forin sa.parents:    print(parent)# a# div# [document]# None

3.3 Sibling nodes

A sibling node is a synonym node that has the same parent node. The 3 p tags in the div_html defined in 3.1 are sibling nodes. Use the following node tag property to access the sibling node

Next_sibling: The next sibling node of the current node
Previous_sibling: The previous sibling node of the current node
Next_siblings: All sibling nodes after the current node
Previous_siblings: All sibling nodes before the current node

4. Search the document tree

Search function can be said to be used in the process of writing crawler functions, to find the specified node, Beautiful soup defined a lot of search methods, the parameters and usage of these search methods are very similar, the query function is very powerful, the following mainly for Find_all method description.

4.1 Filter

A filter is a matching rule in the process of using a search method, which is the possible value of a parameter. Filters can be in the following ways:

String: Find_all (' div ')
List: Find_all ([' div ', ' span '])
Regular expression: Find_all (Re.compile (' [a-z]{1,3} '))
True: Matches any non-string child node
Method: Corresponds to a matching callback function that returns True to indicate a match

=‘<nav><a>a_1</a><a>a_2</a>string</nav>‘=‘lxml‘=# <nav><a>a_1</a><a>a_2</a>string</nav>nav_node.find_all(True)  # 不匹配 string# [‘<a>a_1</a>‘, ‘<a>a_2</a>‘]def has_class_but_no_id(tag):  # 定义匹配函数    return tag.has_attr(‘class‘andnot tag.has_attr(‘id‘)soup.find_all(has_class_but_no_id)  # 返回有class属性没有id属性的节点

4.2 Find_all () find () method

The Find_all method returns a list of all nodes that match the search or is empty, and the Find method returns the results of the first matching search directly. The detailed definitions are as follows:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None**kwargs)find_find(name=None, attrs={}, recursive=True, text=None, limit=None**kwargs)

Each parameter has the following meanings:

Name: Match Tag label signature
Attrs: Match attribute Name: Find_all (href= ' index.html ')
Text: Match string contents in document
Limit: Specify the number of result sets to match
Recursive: Default true,false means only direct child nodes are searched

The above parameter values can be any one of the filter values described in 4.1. In addition, the following points need to be noted:

The Attrs parameter is a dictionary type and can be a combination of multiple attribute conditions
When using the class attribute alone, you should use the Class_
The Class property is a multi-valued property, and each CSS class name is searched separately
Must be in the same order when the CSS class matches exactly

# Search All div tagsSoup.find_all (' div ')# Search for all nodes with ID attribute and id attribute value Link1 or Link2Soup.find_all (ID=[' Link1 ',' Link2 '])# Search All the class attributes that contain a button nodeSoup.find_all (Class_=' button ')# Search All P tags that match the content of a given regular expressionSoup.find_all (' P ', text=Re.Compile(' Game '))# Search for a label with a button class and an HREF attribute with a value of Link1Soup.find_all (' A ', {' Classl ':' button ',' href ':' Link1 '})# Search only for a label for a direct child nodeSoup.find_all (' A ', limit=1, recursive=False)

4.3 Other methods

The other search method parameters are similar to Find_all and find, they appear in pairs, return the result list and the first matching result respectively, but the scope of the search document is not the same, here are some common examples:

Find_parents () and Find_parent (): Search only from the parent node of the current node
Find_next_siblings () and find_next_sibling (): Search only from the sibling nodes behind the current node
Find_previous_siblings () and find_previous_sibling (): Search only from the sibling nodes in front of the current node
Find_all_next () and Find_next (): Searching from a node after the current node
Find_all_previous () and find_previous (): Searching from a node before the current node

4.4 CSS Selector

Beautiful soup supports most CSS selectors, using the Select () method to search. such as the ID selector, the tag Selector, the property selector sets the combo selector. Such as:

Soup.select (' Body a ')
Soup.select (' #top ')
Soup.select (' div > span ')
Soup.select ('. Button + img ')

5. Modify the document tree

When we get the search results, we want some nodes in the search results to not participate in the subsequent search, then we need to delete these nodes, which need to modify the document tree. Here's how to modify it:

Append (): Add content to the current tag (string)
New_tag (), new_string (): Add a piece of text
Insert (index, content): Insert property at index position
Insert_before () and Insert_after (): inserting content before and after the current node
Clear (): Remove current node contents
Extract (); Removes the current tag from the document tree and returns as a result
Decompose (): Removes the current node from the document tree and destroys it completely
Replace_with (): Removes a piece of content from the document tree and replaces it with a new tag or text node
Wrap (): wraps the specified tag element and returns the result after wrapping

6. Output

Sometimes, you need to show or save a document tree that you've found or modified.
The Prettify () method formats the document tree of beautiful soup and outputs it in Unicode encoding, with each xml/html tag exclusively on one line. Both the BeautifulSoup object and its tag node can call the Prettify () method.

You can use the Python Unicode () or STR () method on a BeautifulSoup object or Tag object to compress the output.

Original address: http://uusama.com/467.html

Python crawler tool: Beautiful Soup

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More