International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python crawler Primer (4)--detailed parsing library of HTML text BeautifulSoup

Last Update:2017-05-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Beautiful soup is a library of Python, and the main function is to fetch data from a Web page. The following article mainly introduces the Python crawler HTML text parsing library BeautifulSoup related data, the article introduced in very detailed, for everyone has a certain reference learning value, the need for friends below to see it together.

Objective

The 3rd article of the Python Crawler series introduces the Web request library artifact requests, which requests data to be retrieved after the request, and the content returned by different websites usually has many different formats, one in JSON format, which is the most friendly to developers. In another XML format, there is one of the most common formats for HTML documents, and today it's about extracting interesting data from HTML.

Do you write an HTML parser to parse it? Or do you use regular expressions? These are not the best ways, fortunately, the Python community in this convenient long ago has a very mature program, BeautifulSoup is the bane of this kind of problem, it focused on the HTML document operation, the name from Lewis Carroll, a poem of the same name.

BeautifulSoup is a Python library for parsing HTML documents, and with BeautifulSoup, you can extract any content of interest in HTML with very little code, and it also has some HTML fault tolerance, It can also be handled correctly for an HTML document that is not fully formatted.

Installing BeautifulSoup

Pip Install Beautifulsoup4

BEAUTIFULSOUP3 is officially abandoned maintenance, you want to download the latest version of BEAUTIFULSOUP4.

HTML tags

Before learning BeautifulSoup4, it is necessary to first have a basic understanding of HTML documents, the following code, HTML is a tree-like organization structure.



 
 
   
   It consists of a lot of tags (tag), such as HTML, head, title and so on are tags 
   A label pair forms a node, such as ... is a root node 
   There is a relationship between the nodes, for example, H1 and P are neighbors, and they are adjacent sibling (sibling) nodes 
   H1 is the direct child (children) node of the body, or the descendant (descendants) node of the HTML 
   Body is the parent node of P, and HTML is the ancestor (parents) node of P 
   The string nested between the tags is a special sub-node under the node, such as "Hello, World" is also a node, but no name. 
  

 
Using BeautifulSoup

Building a BeautifulSoup object requires two parameters, the first parameter is the HTML text string to parse, and the second parameter tells BeautifulSoup which parser to use to parse the HTML.

The parser is responsible for parsing the HTML into related objects, while BeautifulSoup is responsible for manipulating the data (adding and pruning). "Html.parser" is a python built-in parser, "lxml" is a C language-based parser, it executes faster, but it requires additional installation

You can navigate to any of the tag nodes in the HTML by BeautifulSoup the object.

from bs4 import BeautifulSoup Text = "" "
Beatifulsoup HTML abstraction into the 4 main types of data, namely, tag, navigablestring, Beautifulsoup,comment. Each tag node is a tag object, and the Navigablestring object is typically a string wrapped in a tag object, and the BeautifulSoup object represents the entire HTML document. For example:

>>> type (soup) <class ' BS4. BeautifulSoup ' >>>> type (soup.h1) <class ' Bs4.element.Tag ' >>>> type (soup.p.string) < Class ' bs4.element.NavigableString ' >
Tag

Each tag has a name that corresponds to the tag name of the HTML.



>>> Soup.h1.nameu ' H1 ' >>> Soup.p.nameu ' P '
Tags can also have properties that are accessed in a way that is similar to a dictionary, which returns a list object

>>> soup.p[' class '][u ' bold '
Navigablestring

Gets the contents of the tag, directly using the. STIRNG, which is a Navigablestring object that you can explicitly convert to a Unicode string.

>>> Soup.p.stringu ' \u5982\u4f55\u4f7f\u7528beautifulsoup ' >>> type (soup.p.string) <class ' Bs4.element.NavigableString ' >>>> unicode_str = Unicode (soup.p.string) >>> Unicode_stru ' \u5982\ U4f55\u4f7f\u7528beautifulsoup '
After the introduction of the basic concept, we can now formally enter the topic, how to find the data we care about from the HTML? BeautifulSoup provides two ways, one is traversal, the other is search, and usually a combination of both to complete a find task.

Traverse the document Tree

Traversing the document tree, as the name implies, is to start the traversal from the root node HTML tag until the target element is found, and one drawback of the traversal is that if you're looking for something at the end of the document, it's going to traverse the entire document to find it, which is slower. Therefore, the second method needs to be matched.

Obtaining a label node by traversing the document tree can be obtained directly through the. Tag name, for example:

Get body Tag:

>>> soup.body<body>\n
Get P Label

>>> soup.body.p<p class= "Bold" >\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>
Get the contents of the P tag

>>> Soup.body.p.string\u5982\u4f55\u4f7f\u7528beautifulsoup
As I said earlier, the content is also a node, which can be obtained by using. String. Another disadvantage of traversing the document tree is that you can only get to the first child node that matches it, for example, if there are two adjacent P tags, the second label cannot be obtained by means of the. P, which is the need to borrow the Next_sibling property to get the adjacent and trailing nodes. In addition, there are many less commonly used properties, such as:. Contents gets all child nodes,. Parent Gets the parent node, see the official documentation for more references.

Search the document tree

The search document tree searches for elements by specifying a tag name, and you can also pinpoint a node element by specifying its attribute value, and the two most common methods are find and Find_all. Both of these methods can be called on both the Beatifulsoup and the Tag objects.

Find_all ()

Find_all (name, Attrs, recursive, text, **kwargs)
The return value of Find_all is a list of tags, and the method invocation is very flexible and all parameters are optional.

The first parameter, name, is the name of the label node.

# Find all the tags named title node >>> soup.find_all ("title") [<title>hello, world</title>]>>> Soup.find_all ("P") [<p class= "bold" >\XC8\XE7\XBA\XCE\XCA\XB9\XD3\XC3BEAUTIFULSOUP</P>, <p class= " Big "> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
The second parameter is the Class property value of the label

# Find all class attributes for big P tags >>> soup.find_all ("P", "big") [<p class= "Big" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea \xc7\xa9</p>]
is equivalent to

>>> soup.find_all ("P", class_= "big") [<p class= "Big" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</ P>]
Because class is a Python keyword, this is specified as Class_.

Kwargs is a property name value pair for a tag, for example: Find a label with an href attribute value of "http://foofish.net"

>>> Soup.find_all (href= "foofish.net" rel= "external nofollow" rel= "external nofollow" rel= "External nofollow" Rel= "External nofollow" rel= "external nofollow" rel= "external nofollow") [<a href= "foofish.net" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external Nofollow ">PYTHON</A>]
Of course, it also supports regular expressions

>>> Import re>>> Soup.find_all (Href=re.compile ("^http")) [<a href= "Http://foofish.net" rel= " External nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" External nofollow ">PYTHON</A>]
An attribute can be a Boolean value (True/flase) In addition to a specific value, a regular expression, or a property.

>>> Soup.find_all (id= "Key1") [<p class= "Big" id= "Key1" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9 </p>]>>> Soup.find_all (id=true) [<p class= "Big" id= "Key1" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\ Xc7\xa9</p>]
Traversal and search combined to find, first locate the body tag, narrow the search, and then find a tag from the body.

>>> Body_tag = soup.body>>> Body_tag.find_all ("a") [<a href= "http://foofish.net" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external Nofollow ">PYTHON</A>]
Find ()

The Find method is similar to Find_all, except that it returns a single Tag object instead of a list, and returns none if no matching node is found. If multiple tags are matched, only the No. 0 one is returned.

>>> Body_tag.find ("a") <a href= "foofish.net" rel= "external nofollow" rel= "external nofollow" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" External nofollow ">python</a>>>> Body_tag.find ("P") <p class= "Bold" >\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>
Get_text ()

Gets the contents of the tag, in addition to using the. String, you can also use the Get_text method, except for a Navigablestring object returned by the former, which returns a string of Unicode type.

>>> p1 = body_tag.find (' P '). Get_text () >>> type (p1) <type ' Unicode ' >>>> p1u ' \xc8\xe7\ Xba\xce\xca\xb9\xd3\xc3beautifulsoup ' >>> p2 = body_tag.find ("P") .string>>> type (p2) <class ' Bs4.element.NavigableString ' >>>> p2u ' \xc8\xe7\xba\xce\xca\xb9\xd3\xc3beautifulsoup ' >>>
In the actual scenario, we typically use the Get_text method to get the contents of the tag.

Summarize

Beatifulsoup is a Python library for manipulating HTML documents, and when initializing beatifulsoup, you need to specify an HTML document string and a specific parser. Beatifulsoup has 3 commonly used data types, namely Tag, navigablestring, and BeautifulSoup. There are two ways to find HTML elements, which are to traverse the document tree and search the document tree, often with a combination of fast data acquisition.

"Recommended"

1. Python Crawler Primer (5)--Regular Expression Example tutorial

2. Python Crawler Introduction (3)--using requests to build a knowledge API

3. Python crawler Primer (2)--http Library requests

4. Summarize Python's logical operators and

5. Python crawler Primer (1)--Quick understanding of HTTP protocol

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

Python: send emails 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Primer (4)--detailed parsing library of HTML text BeautifulSoup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support