In previous articles, we learned how to get the content of HTML documents, that is, to download pages from URLs. Starting today, we'll discuss how to turn HTML into a Python object and analyze the document in Python code.
(Niu Xiao-Mei in school for several days, also did not put HTML documents to analyze it.) The next few articles, you will have a good look at it)
Beautiful soup transforms complex HTML documents into a complex tree structure where each node is a Python object, and all objects can be summed up in 4 types: Tag, navigablestring, BeautifulSoup, Comment
The tag object is the same as the tag in the XML or HTML native document
Get and modify the name and properties of an object
From BS4 import BeautifulSoup
#注意, the second parameter must be a string, according to official documents to error. Now BeautifulSoup is 4.6.
Soup = BeautifulSoup (' <b class= ' boldest ">extremely bold</b> '," Lxml-xml ")
Tag = soup.b
#b标签对应的python对象
Print (Type (tag))
Print (Tag.name)
#修改标签 name is not the label's Name property, but the label itself
Tag.name = "Blockquote"
Print (TAG)
#获取属性
Print (tag[' class ')
#获取多个属性
Print (Tag.attrs)
#修改属性
Tag[' class '] = ' verybold '
tag[' id '] = 1
Print (TAG)
#删除属性
Del tag[' class ']
del tag[' id ']
Print (TAG)
#已经没有class属性 Get an error
Print (tag[' class ')
Print (Tag.get (' class '))
Multiple-finger properties:
Refers to a property that has multiple values.
Note: The Lxml-xml parser is used here so it is not visible that it is multi-valued. With HTML. The parser turns out to be multi-valued.
Css_soup = BeautifulSoup (' <p class= ' body strikeout ' ></p> ', ' lxml-xml ')
Print (css_soup.p[' class ')
Css_soup = BeautifulSoup (' <p class= ' body ></p> ', ' lxml-xml ')
Print (css_soup.p[' class ')
Corresponding results:
Body Strikeout
Body
If a property appears to have multiple values but is not defined as a multivalued attribute in any version of the HTML definition, then beautiful soup returns the property as a string
Id_soup = BeautifulSoup (' <p id= ' my id ' ></p> ', ' lxml-xml ')
#返回的是字符串
Print (id_soup.p[' id ')
When you convert a tag to a string, the multivalued attribute is merged into a single value
Rel_soup = BeautifulSoup (' <p>back to the <a rel= ' index ' >homepage</a></p> ', ' lxml-xml ')
Print (rel_soup.a[' rel ')
rel_soup.a[' rel '] = [' index ', ' contents ']
Print (REL_SOUP.P)
Show Results:
<p>back to the <a rel= "Index contents" >homepage</a></p>
Note the Rel attribute of the A label
Strings that can be traversed
Strings are often contained within tags. Beautiful Soup Use the Navigablestring class to wrap a string in the tag
Soup = BeautifulSoup (' <b class= "boldest" >extremely bold</b> ')
Tag = soup.b
Print (tag.string)
Print (Type (tag.string))
Results:
Extremely bold
<class ' bs4.element.NavigableString ' >
A navigablestring string is the same as a Unicode string in Python, and it also supports some of the attributes contained in traversing the document tree and searching the document tree. The Navigablestring object can be converted directly to a Unicode string using the Unicode () method
From BS4 import BeautifulSoup
From Lxml.html.clean import Unicode
Soup = BeautifulSoup (' <b class= "boldest" >extremely bold</b> ')
Tag = soup.b
unicode_string = Unicode (tag.string)
Print (unicode_string)
Results:
Extremely bold
The string contained in the tag cannot be edited, but can be replaced with another string, using the Replace_with () method:
From BS4 import BeautifulSoup
Soup = BeautifulSoup (' <b class= "boldest" >extremely bold</b> ')
Tag = soup.b
Tag.string.replace_with ("No longer bold")
Print (TAG)
Results:
<b class= "boldest" >no longer bold</b>
Attention:
Navigablestring objects support traversing the document tree and searching for most of the properties defined in the document tree, not all of them. In particular, a string cannot contain other content (the tag can contain a string or other tag), and the string does not support the. Contents or. String property or fi nd () method.
If you want to use a Navigablestring object outside of beautiful soup, you need to call the Unicode () method to convert the object to a normal Unicode string, or even if the beautiful soup method has already executed the end, The output of the object will also have a reference address for the object. This wastes memory.
BeautifulSoup Object
The BeautifulSoup object represents the entire contents of a document. Most of the time, it can be used as a Tag object, which supports traversing the document tree and searching for most of the methods described in the document tree.
Because the BeautifulSoup object is not a real HTML or XML tag, it does not have a name and a attribute attribute. But sometimes it is convenient to view its. Name property, so the BeautifulSoup object contains a value of "[ Document] "Special properties. Name
Know can
Comments and special strings
Tag, navigablestring, BeautifulSoup almost all the content in HTML and XML, but there are some special objects. What's easy to worry about is the comment section of the document
From BS4 import BeautifulSoup, CData
Markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b> "
Soup = beautifulsoup (markup)
Comment = soup.b.string
Print (type (comment))
# Comment object is a special type of navigablestring object:
Print (comment)
#美化后的输出结果
Print (Soup.b.prettify ())
# Other types defined in Beautiful soup may appear in the XML document:
# CData, ProcessingInstruction, Declaration, Doctype. Similar to the Comment object,
# These classes are all navigablestring subclasses, just add some extra methods to the string exclusive.
# Here's an example of using CDATA instead of annotations:
CDATA = CDATA ("A CData block")
Comment.replace_with (CDATA)
Print (Soup.b.prettify ())
# <b>
# <! [Cdata[a CDATA block]]>
# </b>
Python crawler Series (iv): Beautiful soup parsing html into a Python object