Python crawler Primer (4)--detailed parsing library of HTML text BeautifulSoup

Source: Internet
Author: User
Beautiful soup is a library of Python, and the main function is to fetch data from a Web page. The following article mainly introduces the Python crawler HTML text parsing library BeautifulSoup related data, the article introduced in very detailed, for everyone has a certain reference learning value, the need for friends below to see it together.

Objective

The 3rd article of the Python Crawler series introduces the Web request library artifact requests, which requests data to be retrieved after the request, and the content returned by different websites usually has many different formats, one in JSON format, which is the most friendly to developers. In another XML format, there is one of the most common formats for HTML documents, and today it's about extracting interesting data from HTML.

Do you write an HTML parser to parse it? Or do you use regular expressions? These are not the best ways, fortunately, the Python community in this convenient long ago has a very mature program, BeautifulSoup is the bane of this kind of problem, it focused on the HTML document operation, the name from Lewis Carroll, a poem of the same name.

BeautifulSoup is a Python library for parsing HTML documents, and with BeautifulSoup, you can extract any content of interest in HTML with very little code, and it also has some HTML fault tolerance, It can also be handled correctly for an HTML document that is not fully formatted.

Installing BeautifulSoup

Pip Install Beautifulsoup4

BEAUTIFULSOUP3 is officially abandoned maintenance, you want to download the latest version of BEAUTIFULSOUP4.

HTML tags

Before learning BeautifulSoup4, it is necessary to first have a basic understanding of HTML documents, the following code, HTML is a tree-like organization structure.



 
    • It consists of a lot of tags (tag), such as HTML, head, title and so on are tags

    • A label pair forms a node, such as ... is a root node

    • There is a relationship between the nodes, for example, H1 and P are neighbors, and they are adjacent sibling (sibling) nodes

    • H1 is the direct child (children) node of the body, or the descendant (descendants) node of the HTML

    • Body is the parent node of P, and HTML is the ancestor (parents) node of P

    • The string nested between the tags is a special sub-node under the node, such as "Hello, World" is also a node, but no name.

Using BeautifulSoup

Building a BeautifulSoup object requires two parameters, the first parameter is the HTML text string to parse, and the second parameter tells BeautifulSoup which parser to use to parse the HTML.

The parser is responsible for parsing the HTML into related objects, while BeautifulSoup is responsible for manipulating the data (adding and pruning). "Html.parser" is a python built-in parser, "lxml" is a C language-based parser, it executes faster, but it requires additional installation

You can navigate to any of the tag nodes in the HTML by BeautifulSoup the object.

from bs4 import BeautifulSoup Text = "" "

Beatifulsoup HTML abstraction into the 4 main types of data, namely, tag, navigablestring, Beautifulsoup,comment. Each tag node is a tag object, and the Navigablestring object is typically a string wrapped in a tag object, and the BeautifulSoup object represents the entire HTML document. For example:

>>> type (soup) <class ' BS4. BeautifulSoup ' >>>> type (soup.h1) <class ' Bs4.element.Tag ' >>>> type (soup.p.string) < Class ' bs4.element.NavigableString ' >

Tag

Each tag has a name that corresponds to the tag name of the HTML.


>>> Soup.h1.nameu ' H1 ' >>> Soup.p.nameu ' P '

Tags can also have properties that are accessed in a way that is similar to a dictionary, which returns a list object

>>> soup.p[' class '][u ' bold '

Navigablestring

Gets the contents of the tag, directly using the. STIRNG, which is a Navigablestring object that you can explicitly convert to a Unicode string.

>>> Soup.p.stringu ' \u5982\u4f55\u4f7f\u7528beautifulsoup ' >>> type (soup.p.string) <class ' Bs4.element.NavigableString ' >>>> unicode_str = Unicode (soup.p.string) >>> Unicode_stru ' \u5982\ U4f55\u4f7f\u7528beautifulsoup '

After the introduction of the basic concept, we can now formally enter the topic, how to find the data we care about from the HTML? BeautifulSoup provides two ways, one is traversal, the other is search, and usually a combination of both to complete a find task.

Traverse the document Tree

Traversing the document tree, as the name implies, is to start the traversal from the root node HTML tag until the target element is found, and one drawback of the traversal is that if you're looking for something at the end of the document, it's going to traverse the entire document to find it, which is slower. Therefore, the second method needs to be matched.

Obtaining a label node by traversing the document tree can be obtained directly through the. Tag name, for example:

Get body Tag:

>>> soup.body<body>\n

Get P Label

>>> soup.body.p<p class= "Bold" >\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

Get the contents of the P tag

>>> Soup.body.p.string\u5982\u4f55\u4f7f\u7528beautifulsoup

As I said earlier, the content is also a node, which can be obtained by using. String. Another disadvantage of traversing the document tree is that you can only get to the first child node that matches it, for example, if there are two adjacent P tags, the second label cannot be obtained by means of the. P, which is the need to borrow the Next_sibling property to get the adjacent and trailing nodes. In addition, there are many less commonly used properties, such as:. Contents gets all child nodes,. Parent Gets the parent node, see the official documentation for more references.

Search the document tree

The search document tree searches for elements by specifying a tag name, and you can also pinpoint a node element by specifying its attribute value, and the two most common methods are find and Find_all. Both of these methods can be called on both the Beatifulsoup and the Tag objects.

Find_all ()

Find_all (name, Attrs, recursive, text, **kwargs)

The return value of Find_all is a list of tags, and the method invocation is very flexible and all parameters are optional.

The first parameter, name, is the name of the label node.

# Find all the tags named title node >>> soup.find_all ("title") [<title>hello, world</title>]>>> Soup.find_all ("P") [<p class= "bold" >\XC8\XE7\XBA\XCE\XCA\XB9\XD3\XC3BEAUTIFULSOUP</P>, <p class= " Big "> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

The second parameter is the Class property value of the label

# Find all class attributes for big P tags >>> soup.find_all ("P", "big") [<p class= "Big" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea \xc7\xa9</p>]

is equivalent to

>>> soup.find_all ("P", class_= "big") [<p class= "Big" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</ P>]

Because class is a Python keyword, this is specified as Class_.

Kwargs is a property name value pair for a tag, for example: Find a label with an href attribute value of "http://foofish.net"

>>> Soup.find_all (href= "foofish.net" rel= "external nofollow" rel= "external nofollow" rel= "External nofollow" Rel= "External nofollow" rel= "external nofollow" rel= "external nofollow") [<a href= "foofish.net" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external Nofollow ">PYTHON</A>]

Of course, it also supports regular expressions

>>> Import re>>> Soup.find_all (Href=re.compile ("^http")) [<a href= "Http://foofish.net" rel= " External nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" External nofollow ">PYTHON</A>]

An attribute can be a Boolean value (True/flase) In addition to a specific value, a regular expression, or a property.

>>> Soup.find_all (id= "Key1") [<p class= "Big" id= "Key1" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9 </p>]>>> Soup.find_all (id=true) [<p class= "Big" id= "Key1" > \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\ Xc7\xa9</p>]

Traversal and search combined to find, first locate the body tag, narrow the search, and then find a tag from the body.

>>> Body_tag = soup.body>>> Body_tag.find_all ("a") [<a href= "http://foofish.net" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" external Nofollow ">PYTHON</A>]

Find ()

The Find method is similar to Find_all, except that it returns a single Tag object instead of a list, and returns none if no matching node is found. If multiple tags are matched, only the No. 0 one is returned.

>>> Body_tag.find ("a") <a href= "foofish.net" rel= "external nofollow" rel= "external nofollow" rel= "external nofollow "rel=" external nofollow "rel=" external nofollow "rel=" External nofollow ">python</a>>>> Body_tag.find ("P") <p class= "Bold" >\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>

Get_text ()

Gets the contents of the tag, in addition to using the. String, you can also use the Get_text method, except for a Navigablestring object returned by the former, which returns a string of Unicode type.

>>> p1 = body_tag.find (' P '). Get_text () >>> type (p1) <type ' Unicode ' >>>> p1u ' \xc8\xe7\ Xba\xce\xca\xb9\xd3\xc3beautifulsoup ' >>> p2 = body_tag.find ("P") .string>>> type (p2) <class ' Bs4.element.NavigableString ' >>>> p2u ' \xc8\xe7\xba\xce\xca\xb9\xd3\xc3beautifulsoup ' >>>

In the actual scenario, we typically use the Get_text method to get the contents of the tag.

Summarize

Beatifulsoup is a Python library for manipulating HTML documents, and when initializing beatifulsoup, you need to specify an HTML document string and a specific parser. Beatifulsoup has 3 commonly used data types, namely Tag, navigablestring, and BeautifulSoup. There are two ways to find HTML elements, which are to traverse the document tree and search the document tree, often with a combination of fast data acquisition.

"Recommended"

1. Python Crawler Primer (5)--Regular Expression Example tutorial

2. Python Crawler Introduction (3)--using requests to build a knowledge API

3. Python crawler Primer (2)--http Library requests

4. Summarize Python's logical operators and

5. Python crawler Primer (1)--Quick understanding of HTTP protocol

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.