Python crawler Series (iv): Beautiful soup parsing html into a Python object

Last Update:2017-10-20 Source: Internet

Author: User

Tags cdata xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In previous articles, we learned how to get the content of HTML documents, that is, to download pages from URLs. Starting today, we'll discuss how to turn HTML into a Python object and analyze the document in Python code.

(Niu Xiao-Mei in school for several days, also did not put HTML documents to analyze it.) The next few articles, you will have a good look at it)

Beautiful soup transforms complex HTML documents into a complex tree structure where each node is a Python object, and all objects can be summed up in 4 types: Tag, navigablestring, BeautifulSoup, Comment

The tag object is the same as the tag in the XML or HTML native document

Get and modify the name and properties of an object

From BS4 import BeautifulSoup

#注意, the second parameter must be a string, according to official documents to error. Now BeautifulSoup is 4.6.
Soup = BeautifulSoup (' extremely bold '," Lxml-xml ")
Tag = soup.b
#b标签对应的python对象
Print (Type (tag))
Print (Tag.name)
#修改标签 name is not the label's Name property, but the label itself
Tag.name = "Blockquote"
Print (TAG)
#获取属性
Print (tag[' class ')
#获取多个属性
Print (Tag.attrs)

#修改属性
Tag[' class '] = ' verybold '
tag[' id '] = 1
Print (TAG)

#删除属性
Del tag[' class ']
del tag[' id ']
Print (TAG)
#已经没有class属性 Get an error
Print (tag[' class ')
Print (Tag.get (' class '))

Multiple-finger properties:

Refers to a property that has multiple values.

Note: The Lxml-xml parser is used here so it is not visible that it is multi-valued. With HTML. The parser turns out to be multi-valued.

Css_soup = BeautifulSoup (' ', ' lxml-xml ')
Print (css_soup.p[' class ')

Css_soup = BeautifulSoup (' ', ' lxml-xml ')
Print (css_soup.p[' class ')

Corresponding results:

Body Strikeout

Body

If a property appears to have multiple values but is not defined as a multivalued attribute in any version of the HTML definition, then beautiful soup returns the property as a string

Id_soup = BeautifulSoup (' ', ' lxml-xml ')

#返回的是字符串
Print (id_soup.p[' id ')

When you convert a tag to a string, the multivalued attribute is merged into a single value

Rel_soup = BeautifulSoup (' back to the <a rel= ' index ' >homepage</a> ', ' lxml-xml ')
Print (rel_soup.a[' rel ')
rel_soup.a[' rel '] = [' index ', ' contents ']
Print (REL_SOUP.P)

Show Results:

back to the <a rel= "Index contents" >homepage</a>

Note the Rel attribute of the A label

Strings that can be traversed

Strings are often contained within tags. Beautiful Soup Use the Navigablestring class to wrap a string in the tag

Soup = BeautifulSoup (' extremely bold ')
Tag = soup.b
Print (tag.string)
Print (Type (tag.string))

Results:

Extremely bold

A navigablestring string is the same as a Unicode string in Python, and it also supports some of the attributes contained in traversing the document tree and searching the document tree. The Navigablestring object can be converted directly to a Unicode string using the Unicode () method

From BS4 import BeautifulSoup
From Lxml.html.clean import Unicode

Soup = BeautifulSoup (' extremely bold ')
Tag = soup.b

unicode_string = Unicode (tag.string)
Print (unicode_string)
Results:

Extremely bold

The string contained in the tag cannot be edited, but can be replaced with another string, using the Replace_with () method:

From BS4 import BeautifulSoup

Soup = BeautifulSoup (' extremely bold ')
Tag = soup.b

Tag.string.replace_with ("No longer bold")

Print (TAG)

Results:

no longer bold

Attention:

Navigablestring objects support traversing the document tree and searching for most of the properties defined in the document tree, not all of them. In particular, a string cannot contain other content (the tag can contain a string or other tag), and the string does not support the. Contents or. String property or fi nd () method.

If you want to use a Navigablestring object outside of beautiful soup, you need to call the Unicode () method to convert the object to a normal Unicode string, or even if the beautiful soup method has already executed the end, The output of the object will also have a reference address for the object. This wastes memory.

BeautifulSoup Object

The BeautifulSoup object represents the entire contents of a document. Most of the time, it can be used as a Tag object, which supports traversing the document tree and searching for most of the methods described in the document tree.

Because the BeautifulSoup object is not a real HTML or XML tag, it does not have a name and a attribute attribute. But sometimes it is convenient to view its. Name property, so the BeautifulSoup object contains a value of "[ Document] "Special properties. Name

Know can

Comments and special strings

Tag, navigablestring, BeautifulSoup almost all the content in HTML and XML, but there are some special objects. What's easy to worry about is the comment section of the document

From BS4 import BeautifulSoup, CData

Markup = " "
Soup = beautifulsoup (markup)
Comment = soup.b.string
Print (type (comment))
# Comment object is a special type of navigablestring object:
Print (comment)
#美化后的输出结果
Print (Soup.b.prettify ())
# Other types defined in Beautiful soup may appear in the XML document:
# CData, ProcessingInstruction, Declaration, Doctype. Similar to the Comment object,
# These classes are all navigablestring subclasses, just add some extra methods to the string exclusive.
# Here's an example of using CDATA instead of annotations:
CDATA = CDATA ("A CData block")
Comment.replace_with (CDATA)

Print (Soup.b.prettify ())
# 
# <! [Cdata[a CDATA block]]>
#

Python crawler Series (iv): Beautiful soup parsing html into a Python object

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More